logical changeset generation v3

Started by Nonameabout 13 years ago133 messages

andres@anarazel.de

about 13 years ago

Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres <<EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);

-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1 start_value[int8]:1 increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:f is_called[bool]:f
COMMIT 16556827
BEGIN 16556828
table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table "replication_example": DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging
xl_running_xact's more than just every checkpoint.

A short overview over the patches in this series:

* Add minimal binary heap implementation
Abhijit submitted a nicer version of this, the plan is to rebase ontop
of that once people are happy with the interface.
(unchanged)

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

* Add simple xlogdump tool
Very nice for debugging, couldn't have developed this without. Obviously
not a prerequisite for comitting this feature but still pretty worthy.
(quite a bit updated, still bad build infrastructure)

* Add a new RELFILENODE syscache to fetch a pg_class entry via
(reltablespace, relfilenode)
Relatively simple, somewhat contentious due to some uniqueness
issues. Would very much welcome input from somebody with syscache
experience on this. It was previously suggested to write something like
attoptcache.c for this, but to me that seems to be code-duplication. We
can go that route though.
(unchanged)

* Add a new relmapper.c function RelationMapFilenodeToOid that acts as a
reverse of RelationMapOidToFilenode
Simple. I don't even think its contentious... Just wasn't needed before.
(unchanged)

* Add a new function pg_relation_by_filenode to lookup up a relation
given the tablespace and the filenode OIDs
Just a nice to have thing for debugging, not a prerequisite for the
feature.
(unchanged)

* Introduce InvalidCommandId and declare that to be the new maximum for
CommandCounterIncrement
Uncomplicated and I hope uncontentious.
(new)

*Store the number of subtransactions in xl_running_xacts separately from
toplevel xids
Increases the size of xl_running_xacts by 4bytes in the worst case,
decreases it in some others. Improves the efficiency of some HS
operations.
Should be ok?
(new)

* Adjust all *Satisfies routines to take a HeapTuple instead of a
HeapTupleHeader
Not sure if people will complain about this? Its rather simple due to
the fact that the HeapTupleSatisfiesVisibility wrapper already took a
HeapTupleHeader as parameter.
(new)

* Allow walsender's to connect to a specific database
This has been requested by others. I think we need to work on the
external interface a bit, should be ok otherwise.
(new)

* Introduce wal decoding via catalog timetravel
This is the meat of the feature. I think this is going in a good
direction, still needs some work, but architectural review can really
start now. (more later)
(heavily changed)

* Add a simple decoding module in contrib named 'test_decoding'
The much requested example contrib module.
(new)

* Introduce pg_receivellog, the pg_receivexlog equivalent for logical
changes
Debugging tool to receive changes and write them to a file. Needs some
more options and probably shouldn't live inside pg_basebackup's
directory.
(new)

* design document v2.3 and snapshot building design doc v0.2
(unchanged)

There remains quite a bit to be done but I think the state of the patch
has improved quite a bit. The biggest thing now is to get input about
the user facing parts so we can get some aggreement there.

Please comment!

Happy and tired,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 01/14] Add minimal binary heap implementation

Will be replaces by the "binaryheap.[ch]" from Abhijit once its been reviewed.
---
src/backend/lib/Makefile | 3 +-
src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
src/include/lib/simpleheap.h | 91 +++++++++++++++
3 files changed, 348 insertions(+), 1 deletion(-)
create mode 100644 src/backend/lib/simpleheap.c
create mode 100644 src/include/lib/simpleheap.h

Attachments:

0001-Add-minimal-binary-heap-implementation.patchtext/x-patch; name=0001-Add-minimal-binary-heap-implementation.patchDownload

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 98ce3d7..c2bc5ba 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = ilist.o stringinfo.o
+
+OBJS = ilist.o simpleheap.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/simpleheap.c b/src/backend/lib/simpleheap.c
new file mode 100644
index 0000000..825d0a8
--- /dev/null
+++ b/src/backend/lib/simpleheap.c
@@ -0,0 +1,255 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleheap.c
+ *	  A simple binary heap implementaion
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/simpleheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <math.h>
+
+#include "lib/simpleheap.h"
+
+static inline int
+simpleheap_left_off(size_t i)
+{
+	return 2 * i + 1;
+}
+
+static inline int
+simpleheap_right_off(size_t i)
+{
+	return 2 * i + 2;
+}
+
+static inline int
+simpleheap_parent_off(size_t i)
+{
+	return floor((i - 1) / 2);
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off);
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off);
+
+static inline void
+simpleheap_swap(simpleheap *heap, size_t a, size_t b)
+{
+	simpleheap_kv swap;
+	swap.value = heap->values[a].value;
+	swap.key = heap->values[a].key;
+
+	heap->values[a].value = heap->values[b].value;
+	heap->values[a].key = heap->values[b].key;
+
+	heap->values[b].key = swap.key;
+	heap->values[b].value = swap.value;
+}
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t left_off = simpleheap_left_off(node_off);
+		size_t right_off = simpleheap_right_off(node_off);
+		size_t swap_off = 0;
+
+		/* only one child can violate the heap property after a change */
+
+		/* check left child */
+		if (left_off < heap->size &&
+		    heap->compare(&heap->values[left_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap condition violated */
+			swap_off = left_off;
+		}
+
+		/* check right child */
+		if (right_off < heap->size &&
+		    heap->compare(&heap->values[right_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap condition violated */
+
+			/* swap with the smaller child */
+			if (!swap_off ||
+			    heap->compare(&heap->values[right_off],
+			                  &heap->values[left_off]) < 0)
+			{
+				swap_off = right_off;
+			}
+		}
+
+		if (!swap_off)
+		{
+			/* heap condition fullfilled, abort */
+			break;
+		}
+
+		/* swap node with the child violating the property */
+		simpleheap_swap(heap, swap_off, node_off);
+
+		/* recurse, check child subtree */
+		node_off = swap_off;
+	}
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off)
+{
+	/* manually unrolled tail recursion */
+	while (true)
+	{
+		size_t parent_off = simpleheap_parent_off(node_off);
+
+		if (heap->compare(&heap->values[parent_off],
+		                  &heap->values[node_off]) < 0)
+		{
+			/* heap property violated */
+			simpleheap_swap(heap, node_off, parent_off);
+
+			/* recurse */
+			node_off = parent_off;
+		}
+		else
+			break;
+	}
+}
+
+simpleheap*
+simpleheap_allocate(size_t allocate)
+{
+	simpleheap* heap = palloc(sizeof(simpleheap));
+	heap->values = palloc(sizeof(simpleheap_kv) * allocate);
+	heap->size = 0;
+	heap->space = allocate;
+	return heap;
+}
+
+void
+simpleheap_free(simpleheap* heap)
+{
+	pfree(heap->values);
+	pfree(heap);
+}
+
+/* initial building of a heap */
+void
+simpleheap_build(simpleheap *heap)
+{
+	int i;
+
+	for (i = simpleheap_parent_off(heap->size - 1); i >= 0; i--)
+	{
+		simpleheap_sift_down(heap, i);
+	}
+}
+
+/*
+ * Change the
+ */
+void
+simpleheap_change_key(simpleheap *heap, void* key)
+{
+	size_t next_off = 0;
+	int ret;
+	simpleheap_kv* kv;
+
+	heap->values[0].key = key;
+
+	/* no need to do anything if there is only one element */
+	if (heap->size == 1)
+	{
+		return;
+	}
+	else if (heap->size == 2)
+	{
+		next_off = 1;
+	}
+	else
+	{
+		ret = heap->compare(
+			&heap->values[simpleheap_left_off(0)],
+			&heap->values[simpleheap_right_off(0)]);
+
+		if (ret == -1)
+			next_off = simpleheap_left_off(0);
+		else
+			next_off = simpleheap_right_off(0);
+	}
+
+	/*
+	 * compare with the next key. If were still smaller we can skip
+	 * restructuring heap
+	 */
+	ret = heap->compare(
+		&heap->values[0],
+		&heap->values[next_off]);
+
+	if (ret == -1)
+		return;
+
+	kv = simpleheap_remove_first(heap);
+	simpleheap_add(heap, kv->key, kv->value);
+}
+
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value)
+{
+	if (heap->size + 1 == heap->space)
+		Assert("Cannot resize heaps");
+	heap->values[heap->size].key = key;
+	heap->values[heap->size++].value = value;
+}
+
+void
+simpleheap_add(simpleheap* heap, void *key, void *value)
+{
+	simpleheap_add_unordered(heap, key, value);
+	simpleheap_sift_up(heap, heap->size - 1);
+}
+
+simpleheap_kv*
+simpleheap_first(simpleheap* heap)
+{
+	if (!heap->size)
+		Assert("heap is empty");
+	return &heap->values[0];
+}
+
+
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap)
+{
+	if (heap->size == 0)
+		Assert("heap is empty");
+
+	if (heap->size == 1)
+	{
+		heap->size--;
+		return &heap->values[0];
+	}
+
+	simpleheap_swap(heap, 0, heap->size - 1);
+	simpleheap_sift_down(heap, 0);
+
+	heap->size--;
+	return &heap->values[heap->size];
+}
diff --git a/src/include/lib/simpleheap.h b/src/include/lib/simpleheap.h
new file mode 100644
index 0000000..ab2d2ea
--- /dev/null
+++ b/src/include/lib/simpleheap.h
@@ -0,0 +1,91 @@
+/*
+ * simpleheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/lib/simpleheap.h
+ */
+
+#ifndef SIMPLEHEAP_H
+#define SIMPLEHEAP_H
+
+typedef struct simpleheap_kv
+{
+	void* key;
+	void* value;
+} simpleheap_kv;
+
+typedef struct simpleheap
+{
+	size_t size;
+	size_t space;
+	/*
+	 * Has to return:
+	 * -1 iff a < b
+	 * 0 iff a == b
+	 * +1 iff a > b
+	 */
+	int (*compare)(simpleheap_kv* a, simpleheap_kv* b);
+
+	simpleheap_kv *values;
+} simpleheap;
+
+simpleheap*
+simpleheap_allocate(size_t capacity);
+
+void
+simpleheap_free(simpleheap* heap);
+
+/*
+ * Add values without enforcing the heap property.
+ *
+ * simpleheap_build has to be called before relying on anything that needs a
+ * valid heap. This is mostly useful for initially filling a heap and staying
+ * in O(n) instead of O(n log n).
+ */
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value);
+
+/*
+ * Insert key/value pair
+ *
+ * O(log n)
+ */
+void
+simpleheap_add(simpleheap* heap, void *key, void *value);
+
+/*
+ * Returns the first element as indicated by comparisons of the ->compare()
+ * operator
+ *
+ * O(1)
+ */
+simpleheap_kv*
+simpleheap_first(simpleheap* heap);
+
+/*
+ * Returns and removes the first element as indicated by comparisons of the
+ * ->compare() operator
+ *
+ * O(log n)
+ */
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap);
+
+void
+simpleheap_change_key(simpleheap *heap, void* newkey);
+
+
+/*
+ * make the heap fullfill the heap condition. Only needed if elements were
+ * added with simpleheap_add_unordered()
+ *
+ * O(n)
+ */
+void
+simpleheap_build(simpleheap *heap);
+
+
+#endif //SIMPLEHEAP_H

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
src/include/access/xlogreader.h | 264 ++++++++
3 files changed, 1297 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogreader.c
create mode 100644 src/include/access/xlogreader.h

Attachments:

0002-Add-support-for-a-generic-wal-reading-facility-dubbe.patchtext/x-patch; name=0002-Add-support-for-a-generic-wal-reading-facility-dubbe.patchDownload

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 700cfd8..eb6cfc5 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogutils.o
+	xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..71e7d52
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,1032 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/readxlog.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * more extensive validation of read records
+ * * separation of reader/writer
+ * * customizable error response
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog_internal.h"
+#include "access/transam.h"
+#include "catalog/pg_control.h"
+#include "access/xlogreader.h"
+
+/* If (very) verbose debugging is needed:
+ * #define VERBOSE_DEBUG
+ */
+
+XLogReaderState*
+XLogReaderAllocate(void)
+{
+	XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+	int i;
+
+	if (!state)
+		goto oom;
+
+	memset(&state->buf.record, 0, sizeof(XLogRecord));
+	state->buf.record_data_size = XLOG_BLCKSZ*8;
+	state->buf.record_data =
+			malloc(state->buf.record_data_size);
+
+	if (!state->buf.record_data)
+		goto oom;
+
+	memset(state->buf.record_data, 0, state->buf.record_data_size);
+	state->buf.origptr = InvalidXLogRecPtr;
+
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		state->buf.bkp_block_data[i] =
+			malloc(BLCKSZ);
+
+		if (!state->buf.bkp_block_data[i])
+			goto oom;
+	}
+
+	state->is_record_interesting = NULL;
+	state->writeout_data = NULL;
+	state->finished_record = NULL;
+	state->private_data = NULL;
+	state->output_buffer_size = 0;
+
+	XLogReaderReset(state);
+	return state;
+
+oom:
+	ereport(ERROR,
+	        (errcode(ERRCODE_OUT_OF_MEMORY),
+	         errmsg("out of memory"),
+	         errdetail("failed while allocating an XLogReader")));
+	return NULL;
+}
+
+void
+XLogReaderFree(XLogReaderState* state)
+{
+	int i;
+
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		free(state->buf.bkp_block_data[i]);
+	}
+
+	free(state->buf.record_data);
+
+	free(state);
+}
+
+void
+XLogReaderReset(XLogReaderState* state)
+{	state->in_record = false;
+	state->in_record_header = false;
+	state->do_reassemble_record = false;
+	state->in_bkp_blocks = 0;
+	state->in_bkp_block_header = false;
+	state->in_skip = false;
+	state->remaining_size = 0;
+	state->already_written_size = 0;
+	state->incomplete = false;
+	state->initialized = false;
+	state->needs_input = false;
+	state->needs_output = false;
+	state->stop_at_record_boundary = false;
+}
+
+static inline bool
+XLogReaderHasInput(XLogReaderState* state, Size size)
+{
+	XLogRecPtr tmp = state->curptr;
+	XLByteAdvance(tmp, size);
+	if (XLByteLE(state->endptr, tmp))
+		return false;
+	return true;
+}
+
+static inline bool
+XLogReaderHasOutput(XLogReaderState* state, Size size){
+	/* if we don't do output or have no limits in the output size */
+	if (state->writeout_data == NULL || state->output_buffer_size == 0)
+		return true;
+
+	if (state->already_written_size + size > state->output_buffer_size)
+		return false;
+
+	return true;
+}
+
+static inline bool
+XLogReaderHasSpace(XLogReaderState* state, Size size)
+{
+	if (!XLogReaderHasInput(state, size))
+		return false;
+
+	if (!XLogReaderHasOutput(state, size))
+		return false;
+
+	return true;
+}
+
+/* ----------------------------------------------------------------------------
+ * Write out data iff
+ * 1. we have a writeout_data callback
+ * 2. were currently behind startptr
+ *
+ * The 2nd condition requires that we will never start a write before startptr
+ * and finish after it. The code needs to guarantee this.
+ * ----------------------------------------------------------------------------
+ */
+static void
+XLogReaderInternalWrite(XLogReaderState* state, char* data, Size size)
+{
+	/* no point in doing any checks if we don't have a write callback */
+	if (!state->writeout_data)
+		return;
+
+	if (XLByteLT(state->curptr, state->startptr))
+		return;
+
+	state->writeout_data(state, data, size);
+}
+
+/*
+ * Change state so we read the next bkp block if there is one. If there is none
+ * return false so that the caller can consider the record finished.
+ */
+static bool
+XLogReaderInternalNextBkpBlock(XLogReaderState* state)
+{
+	Assert(state->in_record);
+	Assert(state->remaining_size == 0);
+
+	/*
+	 * only continue with in_record=true if we have bkp block
+	 */
+	while (state->in_bkp_blocks)
+	{
+		if (state->buf.record.xl_info &
+		    XLR_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks))
+		{
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "reading bkp block %u", XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks);
+#endif
+			break;
+		}
+		state->in_bkp_blocks--;
+	}
+
+	if (!state->in_bkp_blocks)
+		return false;
+
+	/* bkp blocks are stored without regard for alignment */
+
+	state->in_bkp_block_header = true;
+	state->remaining_size = sizeof(BkpBlock);
+
+	return true;
+}
+
+void
+XLogReaderRead(XLogReaderState* state)
+{
+	state->needs_input = false;
+	state->needs_output = false;
+
+	/*
+	 * Do some basic sanity checking and setup if were starting anew.
+	 */
+	if (!state->initialized)
+	{
+		if (!state->read_page)
+			elog(ERROR, "The read_page callback needs to be set");
+
+		state->initialized = true;
+		/*
+		 * we need to start reading at the beginning of the page to understand
+		 * what we are currently reading. We will skip over that because we
+		 * check curptr < startptr later.
+		 */
+		state->curptr = state->startptr;
+		state->curptr -= state->startptr % XLOG_BLCKSZ;
+
+		Assert(state->curptr % XLOG_BLCKSZ == 0);
+
+		elog(LOG, "start reading from %X/%X, scrolled back to %X/%X",
+		     (uint32) (state->startptr >> 32), (uint32) state->startptr,
+		     (uint32) (state->curptr >> 32), (uint32) state->curptr);
+	}
+	else
+	{
+		/*
+		 * We didn't finish reading the last time round. Since then new data
+		 * could have been appended to the current page. So we need to update
+		 * our copy of that.
+		 *
+		 * XXX: We could tie that to state->needs_input but that doesn't seem
+		 * worth the complication atm.
+		 */
+		XLogRecPtr rereadptr = state->curptr;
+		rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+		XLByteAdvance(rereadptr, SizeOfXLogShortPHD);
+
+		if(!XLByteLE(rereadptr, state->endptr))
+			goto not_enough_input;
+
+		rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+		state->read_page(state, state->cur_page, rereadptr);
+
+		/*
+		 * we will only rely on this data being valid if we are allowed to read
+		 * that far, so its safe to just always read the header. read_page
+		 * always returns a complete page even though its contents may be
+		 * invalid.
+		 */
+		state->page_header = (XLogPageHeader)state->cur_page;
+		state->page_header_size = XLogPageHeaderSize(state->page_header);
+	}
+
+#ifdef VERBOSE_DEBUG
+	elog(LOG, "starting reading for %X/%X from %X/%X",
+	     (uint32)(state->startptr >> 32), (uint32) state->startptr,
+	     (uint32)(state->curptr >> 32), (uint32) state->curptr);
+#endif
+	/*
+	 * Iterate over the data and reassemble it until we reached the end of the
+	 * data. As we advance curptr inside the loop we need to recheck whether we
+	 * have space inside as well.
+	 */
+	while (XLByteLT(state->curptr, state->endptr))
+	{
+		/* how much space is left in the current block */
+		uint32 len_in_block;
+
+		/*
+		 * did we read a partial xlog record due to input/output constraints?
+		 * If yes, we need to signal that to the caller so it can be handled
+		 * sensibly there. E.g. by waiting on a latch till more xlog is
+		 * available.
+		 */
+		bool partial_read = false;
+		bool partial_write = false;
+
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "one loop start: record: %u header %u, skip: %u bkb_block: %d in_bkp_header: %u curptr: %X/%X remaining: %u, off: %u",
+		     state->in_record, state->in_record_header, state->in_skip,
+		     state->in_bkp_blocks, state->in_bkp_block_header,
+		     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+		     state->remaining_size,
+		     (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+
+		/*
+		 * at a page boundary, read the header
+		 */
+		if (state->curptr % XLOG_BLCKSZ == 0)
+		{
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "reading page header, at %X/%X",
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+			/*
+			 * check whether we can read enough to see the short header, we
+			 * need to read the short header's xlp_info to know whether this is
+			 * a short or a long header.
+			 */
+			if (!XLogReaderHasInput(state, SizeOfXLogShortPHD))
+				goto not_enough_input;
+
+			state->read_page(state, state->cur_page, state->curptr);
+			state->page_header = (XLogPageHeader)state->cur_page;
+			state->page_header_size = XLogPageHeaderSize(state->page_header);
+
+			/* check that we have enough space to read/write the full header */
+			if (!XLogReaderHasInput(state, state->page_header_size))
+				goto not_enough_input;
+
+			if (!XLogReaderHasOutput(state, state->page_header_size))
+				goto not_enough_output;
+
+			XLogReaderInternalWrite(state, state->cur_page, state->page_header_size);
+
+			XLByteAdvance(state->curptr, state->page_header_size);
+
+			if (state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+			{
+				if (!state->in_record)
+				{
+					/*
+					 * we need to support this case for initializing a cluster
+					 * because we need to read/writeout a full page but there
+					 * may be none without records being split across.
+					 *
+					 * If we are before startptr there is nothing special about
+					 * this case. Most pages start with a contrecord.
+					 */
+					if(!XLByteLT(state->curptr, state->startptr))
+					{
+						elog(WARNING, "contrecord although we are not in a record at %X/%X, starting at %X/%X",
+						     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+						     (uint32)(state->startptr >> 32), (uint32)state->startptr);
+					}
+					state->in_record = true;
+					state->check_crc = false;
+					state->do_reassemble_record = false;
+					state->remaining_size = state->page_header->xlp_rem_len;
+					continue;
+				}
+				else
+				{
+					if (state->page_header->xlp_rem_len < state->remaining_size)
+						elog(PANIC, "remaining length is smaller than to be read data. xlp_rem_len: %u needed: %u",
+						     state->page_header->xlp_rem_len, state->remaining_size
+							);
+				}
+			}
+			else if (state->in_record)
+			{
+				elog(PANIC, "no contrecord although were in a record that continued onto the next page. info %hhu at page %X/%X",
+				     state->page_header->xlp_info,
+				     (uint32)(state->page_header->xlp_pageaddr >> 32),
+				     (uint32)state->page_header->xlp_pageaddr);
+			}
+		}
+
+		/*
+		 * If a record will start next, skip over alignment padding.
+		 */
+		if (!state->in_record)
+		{
+			/*
+			 * a record must be stored aligned. So skip as far we need to
+			 * comply with that.
+			 */
+			Size skiplen;
+			skiplen = MAXALIGN(state->curptr) - state->curptr;
+
+			if (skiplen)
+			{
+				if (!XLogReaderHasSpace(state, skiplen))
+				{
+#ifdef VERBOSE_DEBUG
+					elog(LOG, "not aligning bc of space");
+#endif
+					/*
+					 * We don't have enough space to read/write the alignment
+					 * bytes, so fake up a skip-state
+					 */
+					state->in_record = true;
+					state->check_crc = false;
+					state->in_skip = true;
+					state->remaining_size = skiplen;
+
+					if (!XLogReaderHasInput(state, skiplen))
+						goto not_enough_input;
+					goto not_enough_output;
+				}
+#ifdef VERBOSE_DEBUG
+				elog(LOG, "aligning from %X/%X to %X/%X, skips %lu",
+				     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+				     (uint32)((state->curptr + skiplen) >> 32),
+				     (uint32)(state->curptr + skiplen),
+				     skiplen
+					);
+#endif
+				XLogReaderInternalWrite(state, NULL, skiplen);
+
+				XLByteAdvance(state->curptr, skiplen);
+
+				/*
+				 * full pages are not treated as continuations, so restart on
+				 * the beginning of the new page.
+				 */
+				if ((state->curptr % XLOG_BLCKSZ) == 0)
+					continue;
+			}
+		}
+
+		/*
+		 * --------------------------------------------------------------------
+		 * Start to read a record
+		 * --------------------------------------------------------------------
+		 */
+		if (!state->in_record)
+		{
+			state->in_record = true;
+			state->in_record_header = true;
+			state->check_crc = true;
+
+			/*
+			 * If the record starts before startptr were not interested in its
+			 * contents. There is also no point in reassembling if were not
+			 * analyzing the contents.
+			 *
+			 * If every record needs to be processed by finish_record restarts
+			 * need to be started after the end of the last record.
+			 *
+			 * See state->restart_ptr for that point.
+			 */
+			if ((state->finished_record == NULL &&
+			     !state->stop_at_record_boundary) ||
+				XLByteLT(state->curptr, state->startptr)){
+				state->do_reassemble_record = false;
+			}
+			else
+				state->do_reassemble_record = true;
+
+			state->remaining_size = SizeOfXLogRecord;
+
+			/*
+			 * we quickly loose the original address of a record as we can skip
+			 * records and such, so keep the original addresses.
+			 */
+			state->buf.origptr = state->curptr;
+
+			INIT_CRC32(state->next_crc);
+		}
+
+		Assert(state->in_record);
+
+		/*
+		 * Compute how much space on the current page is left and how much of
+		 * that we actually are interested in.
+		 */
+
+		/* amount of space on page */
+		if (state->curptr % XLOG_BLCKSZ == 0)
+			len_in_block = 0;
+		else
+			len_in_block = XLOG_BLCKSZ - (state->curptr % XLOG_BLCKSZ);
+
+		/* we have more data available than we need, so read only as much as needed */
+		if (len_in_block > state->remaining_size)
+			len_in_block = state->remaining_size;
+
+		/*
+		 * Handle constraints set by startptr, endptr and the size of the
+		 * output buffer.
+		 *
+		 * Normally we use XLogReaderHasSpace for that, but thats not
+		 * convenient here because we want to read data in parts. It also
+		 * doesn't handle splitting around startptr. So, open-code the logic
+		 * for that.
+		 */
+
+		/* to make sure we always writeout in the same chunks, split at startptr */
+		if (XLByteLT(state->curptr, state->startptr) &&
+		    (state->curptr + len_in_block) > state->startptr )
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->startptr - state->curptr;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to startptr from %lu to %u",
+			     cur_len, len_in_block);
+#endif
+		}
+
+		/* do we have enough valid data to read the current block? */
+		if (state->curptr + len_in_block > state->endptr)
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->endptr - state->curptr;
+			partial_read = true;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to endptr %X/%X %lu to %i at %X/%X",
+			     (uint32)(state->startptr >> 32), (uint32)state->startptr,
+			     cur_len, len_in_block,
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+		}
+
+		/* can we write what we read? */
+		if (state->writeout_data != NULL && state->output_buffer_size != 0
+				&& len_in_block > (state->output_buffer_size - state->already_written_size))
+		{
+#ifdef VERBOSE_DEBUG
+			Size cur_len = len_in_block;
+#endif
+			len_in_block = state->output_buffer_size - state->already_written_size;
+			partial_write = true;
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "truncating len_in_block due to output_buffer_size %lu to %i",
+			     cur_len, len_in_block);
+#endif
+		}
+
+		/* --------------------------------------------------------------------
+		 * copy data of the size determined above to whatever we are currently
+		 * reading.
+		 * --------------------------------------------------------------------
+		 */
+
+		/* nothing to do if were skipping */
+		if (state->in_skip)
+		{
+			/* writeout zero data, original content is boring */
+			XLogReaderInternalWrite(state, NULL, len_in_block);
+
+			/*
+			 * we may not need this here because were skipping over something
+			 * really uninteresting but keeping track of that would be
+			 * unnecessarily complicated.
+			 */
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+		}
+		/* reassemble the XLogRecord struct, quite likely in one-go */
+		else if (state->in_record_header)
+		{
+			/*
+			 * Need to clampt o sizeof(XLogRecord), we don't have the padding
+			 * in buf.record...
+			 */
+			Size already_written = SizeOfXLogRecord - state->remaining_size;
+			Size padding_size = SizeOfXLogRecord - sizeof(XLogRecord);
+			Size copysize = len_in_block;
+
+			if (state->remaining_size - len_in_block < padding_size)
+				copysize = Max(0, state->remaining_size - (int)padding_size);
+
+			memcpy((char*)&state->buf.record + already_written,
+			       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			       copysize);
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copied part of the record. len_in_block %u, remaining: %u",
+			     len_in_block, state->remaining_size);
+#endif
+		}
+		/*
+		 * copy data into the current backup block header so we have enough
+		 * knowledge to read the actual backup block afterwards
+		 */
+		else if (state->in_bkp_block_header)
+		{
+			int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+			BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+
+			Assert(state->in_bkp_blocks);
+
+			memcpy((char*)bkpb + sizeof(BkpBlock) - state->remaining_size,
+			       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			       len_in_block);
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + ((uint32)state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying bkp header for block %d, %u bytes, complete %lu at %X/%X rem %u",
+			     blockno, len_in_block, sizeof(BkpBlock),
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+			     state->remaining_size);
+
+			if (state->remaining_size == len_in_block)
+			{
+				elog(LOG, "block off %u len %u", bkpb->hole_offset, bkpb->hole_length);
+			}
+#endif
+		}
+		/*
+		 * Reassemble the current backup block, those usually are the biggest
+		 * parts of individual XLogRecords so this might take several rounds.
+		 */
+		else if (state->in_bkp_blocks)
+		{
+			int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+			BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+			char* data = state->buf.bkp_block_data[blockno];
+
+			if (state->do_reassemble_record)
+			{
+				memcpy(data + BLCKSZ - bkpb->hole_length - state->remaining_size,
+				       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+				       len_in_block);
+			}
+
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying %u bytes of data for bkp block %d, complete %u",
+			     len_in_block, blockno, state->remaining_size);
+#endif
+		}
+		/*
+		 * read the (rest) of the XLogRecord's data. Note that this is not the
+		 * XLogRecord struct itself!
+		 */
+		else if (state->in_record)
+		{
+			if (state->do_reassemble_record)
+			{
+				if(state->buf.record_data_size < state->buf.record.xl_len){
+					state->buf.record_data_size = state->buf.record.xl_len;
+					state->buf.record_data =
+						realloc(state->buf.record_data,
+						        state->buf.record_data_size);
+					if(!state->buf.record_data)
+						elog(ERROR, "could not allocate memory for contents of an xlog record");
+				}
+
+				memcpy(state->buf.record_data
+				       + state->buf.record.xl_len
+				       - state->remaining_size,
+				       state->cur_page + (state->curptr % XLOG_BLCKSZ),
+				       len_in_block);
+			}
+			XLogReaderInternalWrite(state,
+			                        state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			                        len_in_block);
+
+
+			COMP_CRC32(state->next_crc,
+			           state->cur_page + (state->curptr % XLOG_BLCKSZ),
+			           len_in_block);
+
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "copying %u bytes into a record at off %u",
+			     len_in_block, (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+		}
+
+		/* should handle wrapping around to next page */
+		XLByteAdvance(state->curptr, len_in_block);
+
+		/* do the math of how much we need to read next round */
+		state->remaining_size -= len_in_block;
+
+		/*
+		 * --------------------------------------------------------------------
+		 * we completed whatever we were reading. So, handle going to the next
+		 * state.
+		 * --------------------------------------------------------------------
+		 */
+		if (state->remaining_size == 0)
+		{
+			/* completed reading - and potentially reassembling - the record */
+			if (state->in_record_header)
+			{
+				state->in_record_header = false;
+
+				/* ------------------------------------------------------------
+				 * normally we don't look at the content of xlog records here,
+				 * XLOG_SWITCH is a special case though, as everything left in
+				 * that segment won't be sensbible content.
+				 * So skip to the next segment.
+				 * ------------------------------------------------------------
+				 */
+				if (state->buf.record.xl_rmid == RM_XLOG_ID
+				    && (state->buf.record.xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+				{
+					/*
+					 * Pretend the current data extends to end of segment
+					 */
+					elog(LOG, "XLOG_SWITCH");
+					state->curptr += XLogSegSize - 1;
+					state->curptr -= state->curptr % XLogSegSize;
+
+					state->in_record = false;
+					Assert(!state->in_bkp_blocks);
+					Assert(!state->in_skip);
+					continue;
+				}
+				else if (state->is_record_interesting == NULL ||
+				         state->is_record_interesting(state, &state->buf.record))
+				{
+					state->remaining_size = state->buf.record.xl_len;
+					Assert(state->in_bkp_blocks == 0);
+					Assert(!state->in_bkp_block_header);
+					Assert(!state->in_skip);
+#ifdef VERBOSE_DEBUG
+					elog(LOG, "found interesting record at %X/%X, prev: %X/%X, rmid %hhu, tx %u, len %u tot %u",
+					     (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+					     (uint32)(state->buf.record.xl_prev >> 32), (uint32)(state->buf.record.xl_prev),
+					     state->buf.record.xl_rmid, state->buf.record.xl_xid,
+					     state->buf.record.xl_len, state->buf.record.xl_tot_len);
+#endif
+
+				}
+				/* ------------------------------------------------------------
+				 * ok, everybody aggrees, the content of the current record are
+				 * just plain boring. So fake-up a record that replaces it with
+				 * a NOOP record.
+				 *
+				 * FIXME: we should allow "compressing" the output here. That
+				 * is write something that shows how long the record should be
+				 * if everything is decompressed again. This can radically
+				 * reduce space-usage over the wire.
+				 * It could also be very useful for traditional SR by removing
+				 * unneded BKP blocks from being transferred.  For that we
+				 * would need to recompute CRCs though, which we currently
+				 * don't support.
+				 * ------------------------------------------------------------
+				 */
+				else
+				{
+					/*
+					 * we need to fix up a fake record with correct length that
+					 * can be written out.
+					 */
+					XLogRecord spacer;
+
+					elog(LOG, "found boring record at %X/%X, rmid %hhu, tx %u, len %u tot %u",
+					     (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+					     state->buf.record.xl_rmid, state->buf.record.xl_xid,
+					     state->buf.record.xl_len, state->buf.record.xl_tot_len);
+
+					/*
+					 * xl_tot_len contains the size of the XLogRecord itself,
+					 * we read that already though.
+					 */
+					state->remaining_size = state->buf.record.xl_tot_len
+						- SizeOfXLogRecord;
+
+					state->in_record = true;
+					state->check_crc = true;
+					state->in_bkp_blocks = 0;
+					state->in_skip = true;
+
+					spacer.xl_prev = state->buf.origptr;
+					spacer.xl_xid = InvalidTransactionId;
+					spacer.xl_tot_len = state->buf.record.xl_tot_len;
+					spacer.xl_len = state->buf.record.xl_tot_len - SizeOfXLogRecord;
+					spacer.xl_rmid = RM_XLOG_ID;
+					spacer.xl_info = XLOG_NOOP;
+
+					XLogReaderInternalWrite(state, (char*)&spacer,
+					                        sizeof(XLogRecord));
+
+					/*
+					 * write out the padding in a separate write, otherwise we
+					 * would overrun the stack
+					 */
+					XLogReaderInternalWrite(state, NULL,
+					                        SizeOfXLogRecord - sizeof(XLogRecord));
+
+				}
+			}
+			/*
+			 * in the in_skip case we already read backup blocks because we
+			 * likely read record->xl_tot_len, so everything is finished.
+			 */
+			else if (state->in_skip)
+			{
+				state->in_record = false;
+				state->in_bkp_blocks = 0;
+				state->in_skip = false;
+				/* alignment is handled when starting to read a record */
+			}
+			/*
+			 * We read the header of the current block. Start reading the
+			 * content of that now.
+			 */
+			else if (state->in_bkp_block_header)
+			{
+				BkpBlock* bkpb;
+				int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+
+				Assert(state->in_bkp_blocks);
+
+				bkpb = &state->buf.bkp_block[blockno];
+
+				if(bkpb->hole_length >= BLCKSZ)
+				{
+					elog(ERROR, "hole_length of block %u is %u but maximum is %u",
+					     blockno, bkpb->hole_length, BLCKSZ);
+				}
+
+				if(bkpb->hole_offset >= BLCKSZ)
+				{
+					elog(ERROR, "hole_offset of block %u is %u but maximum is %u",
+					     blockno, bkpb->hole_offset, BLCKSZ);
+				}
+
+				state->remaining_size = BLCKSZ - bkpb->hole_length;
+				state->in_bkp_block_header = false;
+
+#ifdef VERBOSE_DEBUG
+				elog(LOG, "completed reading of header for %d, reading data now %u hole %u, off %u",
+				     blockno, state->remaining_size, bkpb->hole_length,
+				     bkpb->hole_offset);
+#endif
+			}
+			/*
+			 * The current backup block is finished, more could be following
+			 */
+			else if (state->in_bkp_blocks)
+			{
+				int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+				BkpBlock* bkpb;
+				char* bkpb_data;
+
+				Assert(!state->in_bkp_block_header);
+
+				bkpb = &state->buf.bkp_block[blockno];
+				bkpb_data = state->buf.bkp_block_data[blockno];
+
+				/*
+				 * reassemble block to its entirety by removing the bkp_hole
+				 * "compression"
+				 */
+				if(bkpb->hole_length){
+					memmove(bkpb_data + bkpb->hole_offset,
+					        bkpb_data + bkpb->hole_offset + bkpb->hole_length,
+					        BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+					memset(bkpb_data + bkpb->hole_offset,
+					       0,
+					       bkpb->hole_length);
+				}
+
+				state->in_bkp_blocks--;
+
+				state->in_skip = false;
+
+				if(!XLogReaderInternalNextBkpBlock(state))
+					goto all_bkp_finished;
+
+			}
+			/*
+			 * read a non-skipped record, start reading bkp blocks afterwards
+			 */
+			else if (state->in_record)
+			{
+				Assert(!state->in_skip);
+
+				state->in_bkp_blocks = XLR_MAX_BKP_BLOCKS;
+
+				if(!XLogReaderInternalNextBkpBlock(state))
+					goto all_bkp_finished;
+			}
+		}
+		/*
+		 * Something could only be partially read inside a single block because
+		 * of input or output space constraints..
+		 */
+		else if (partial_read)
+		{
+			partial_read = false;
+			goto not_enough_input;
+		}
+		else if (partial_write)
+		{
+			partial_write = false;
+			goto not_enough_output;
+		}
+		/*
+		 * Data continues into the next block.
+		 */
+		else
+		{
+		}
+
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "one loop end: record: %u header: %u, skip: %u bkb_block: %d in_bkp_header: %u curpos: %X/%X remaining: %u, off: %u",
+		     state->in_record, state->in_record_header, state->in_skip,
+		     state->in_bkp_blocks, state->in_bkp_block_header,
+		     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+		     state->remaining_size,
+		     (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+		continue;
+
+		/*
+		 * we fully read a record. Process its contents if needed and start
+		 * reading the next record afterwards
+		 */
+	all_bkp_finished:
+		{
+			Assert(state->in_record);
+			Assert(!state->in_skip);
+			Assert(!state->in_bkp_block_header);
+			Assert(!state->in_bkp_blocks);
+
+			state->in_record = false;
+
+			/* compute and verify crc */
+			COMP_CRC32(state->next_crc,
+			           &state->buf.record,
+			           offsetof(XLogRecord, xl_crc));
+
+			FIN_CRC32(state->next_crc);
+
+			if (state->check_crc &&
+			    state->next_crc != state->buf.record.xl_crc) {
+				elog(ERROR, "crc mismatch: newly computed : %x, existing is %x",
+				     state->next_crc, state->buf.record.xl_crc);
+			}
+
+			/*
+			 * if we haven't reassembled the record there is no point in
+			 * calling the finished callback because we do not have any
+			 * interesting data. do_reassemble_record is false if we don't have
+			 * a finished_record callback.
+			 */
+			if (state->do_reassemble_record)
+			{
+				/* in stop_at_record_boundary thats a valid case */
+				if (state->finished_record)
+				{
+					state->finished_record(state, &state->buf);
+				}
+
+				if (state->stop_at_record_boundary)
+					goto out;
+			}
+
+			/* alignment is handled when starting to read a record */
+#ifdef VERBOSE_DEBUG
+			elog(LOG, "finished record at %X/%X to %X/%X, already_written_size: %lu, reas = %d",
+			     (uint32)(state->curptr >> 32), (uint32)state->curptr,
+			     (uint32)(state->endptr >> 32), (uint32)state->endptr,
+			     state->already_written_size, state->do_reassemble_record);
+#endif
+
+		}
+	}
+out:
+	/*
+	 * we are finished, check whether we finished everything, this may be
+	 * useful for the caller.
+	 */
+	if (state->in_skip)
+	{
+		state->incomplete = true;
+	}
+	else if (state->in_record)
+	{
+		state->incomplete = true;
+	}
+	else
+	{
+		state->incomplete = false;
+	}
+	return;
+
+not_enough_input:
+	/* signal we need more xlog and finish */
+	state->needs_input = true;
+	goto out;
+
+not_enough_output:
+	/* signal we need more space to write output to */
+	state->needs_output = true;
+	goto out;
+}
+
+XLogRecordBuffer*
+XLogReaderReadOne(XLogReaderState* state)
+{
+	bool was_set_to_stop = state->stop_at_record_boundary;
+	XLogRecPtr last_record = state->buf.origptr;
+
+	if (!was_set_to_stop)
+		state->stop_at_record_boundary = true;
+
+	XLogReaderRead(state);
+
+	if (!was_set_to_stop)
+		state->stop_at_record_boundary = false;
+
+	/* check that we fully read it and that its not the same as the last one */
+	if (state->incomplete ||
+	    XLByteEQ(last_record, state->buf.origptr))
+		return NULL;
+
+	return &state->buf;
+}
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..f45c90b
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,264 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ *		Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/access/readxlog.h
+ *
+ * NOTES
+ *		Check the definition of the XLogReaderState struct for instructions on
+ *		how to use the XLogReader infrastructure.
+ *
+ *		The basic idea is to allocate an XLogReaderState via
+ *		XLogReaderAllocate, fill out the wanted callbacks, set startptr/endptr
+ *		and call XLogReaderRead(state). That will iterate over the record as
+ *		long as it has enough input to reassemble a record calling
+ *		is_interesting/finish_record for every record found.
+ *-------------------------------------------------------------------------
+ */
+#ifndef READXLOG_H
+#define READXLOG_H
+
+#include "access/xlog_internal.h"
+
+/*
+ * Used to store a reassembled record.
+ */
+typedef struct XLogRecordBuffer
+{
+	/* the record itself */
+	XLogRecord record;
+
+	/* at which LSN was that record found at */
+	XLogRecPtr origptr;
+
+	/* the data for xlog record */
+	char* record_data;
+	uint32 record_data_size;
+
+	BkpBlock bkp_block[XLR_MAX_BKP_BLOCKS];
+	char* bkp_block_data[XLR_MAX_BKP_BLOCKS];
+} XLogRecordBuffer;
+
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+typedef bool (*XLogReaderStateInterestingCB)(struct XLogReaderState* state,
+                                             XLogRecord* r);
+typedef void (*XLogReaderStateWriteoutCB)(struct XLogReaderState* state,
+                                          char* data, Size length);
+typedef void (*XLogReaderStateFinishedRecordCB)(struct XLogReaderState* state,
+                                                XLogRecordBuffer* buf);
+typedef void (*XLogReaderStateReadPageCB)(struct XLogReaderState* state,
+                                          char* cur_page, XLogRecPtr at);
+
+typedef struct XLogReaderState
+{
+	/* ----------------------------------------
+	 * Public parameters
+	 * ----------------------------------------
+	 */
+
+	/* callbacks */
+
+	/*
+	 * Called to decide whether a xlog record is interesting and should be
+	 * assembled, analyzed (finished_record) and written out or skipped.
+	 *
+	 * Gets passed the current state as the first parameter and and the record
+	 * *header* to decide over as the second.
+	 *
+	 * Return false to skip the record - and output a NOOP record instead - and
+	 * true to reassemble it fully.
+	 *
+	 * If set to NULL every record is considered to be interesting.
+	 */
+	XLogReaderStateInterestingCB is_record_interesting;
+
+	/*
+	 * Writeout xlog data.
+	 *
+	 * The 'state' parameter is passed as the first parameter and a pointer to
+	 * the 'data' and its 'length' as second and third paramter. If the 'data'
+	 * is NULL zeroes need to be written out.
+	 */
+	XLogReaderStateWriteoutCB writeout_data;
+
+	/*
+	 * If set to anything but NULL this callback gets called after a record,
+	 * including the backup blocks, has been fully reassembled.
+	 *
+	 * The first parameter is the current 'state'. 'buf', an XLogRecordBuffer,
+	 * gets passed as the second parameter and contains the record header, its
+	 * data, original position/lsn and backup block.
+	 */
+	XLogReaderStateFinishedRecordCB finished_record;
+
+	/*
+	 * Data input function.
+	 *
+	 * This callback *has* to be implemented.
+	 *
+	 * Has to read XLOG_BLKSZ bytes that are at the location 'at' into the
+	 * memory pointed at by cur_page although everything behind endptr does not
+	 * have to be valid.
+	 */
+	XLogReaderStateReadPageCB read_page;
+
+	/*
+	 * this can be used by the caller to pass state to the callbacks without
+	 * using global variables or such ugliness. It will neither be read or set
+	 * by anything but your code.
+	 */
+	void* private_data;
+
+
+	/* from where to where are we reading */
+
+	/* so we know where interesting data starts after scrolling back to the beginning of a page */
+	XLogRecPtr startptr;
+
+	/* continue up to here in this run */
+	XLogRecPtr endptr;
+
+	/*
+	 * size of the output buffer, if set to zero (default), there is no limit
+	 * in the output buffer size.
+	 */
+	Size output_buffer_size;
+
+	/*
+	 * Stop reading and return after every completed record.
+	 */
+	bool stop_at_record_boundary;
+
+	/* ----------------------------------------
+	 * output parameters
+	 * ----------------------------------------
+	 */
+
+	/* we need new input data - a later endptr - to continue reading */
+	bool needs_input;
+
+	/* we need new output space to continue reading */
+	bool needs_output;
+
+	/* track our progress */
+	XLogRecPtr curptr;
+
+	/*
+	 * are we in the middle of something? This is useful for the outside to
+	 * know whether to start reading anew
+	 */
+	bool incomplete;
+
+	/* ----------------------------------------
+	 * private/internal state
+	 * ----------------------------------------
+	 */
+
+	char cur_page[XLOG_BLCKSZ];
+	XLogPageHeader page_header;
+	uint32 page_header_size;
+	XLogRecordBuffer buf;
+	pg_crc32 next_crc;
+
+	/* ----------------------------------------
+	 * state machine variables
+	 * ----------------------------------------
+	 */
+
+	bool initialized;
+
+	/* are we currently reading a record? */
+	bool in_record;
+
+	/* are we currently reading a record header? */
+	bool in_record_header;
+
+	/* do we want to reassemble the record or just read/write it? */
+	bool do_reassemble_record;
+
+	/* how many bkp blocks remain to be read? */
+	int in_bkp_blocks;
+
+	/*
+	 * the header of a bkp block can be split across pages, so we need to
+	 * support reading that incrementally
+	 */
+	bool in_bkp_block_header;
+
+	/*
+	 * We are not interested in the contents of the `remaining_size` next
+	 * blocks. Don't read their contents and write out zeroes instead.
+	 */
+	bool in_skip;
+
+	/*
+	 * Should we check the crc of the currently read record? In some situations
+	 * - e.g. if we just skip till the start of a record - this doesn't make
+	 * sense.
+	 *
+	 * This needs to be separate from in_skip because we want to be able not
+	 * writeout records but still verify those. E.g. records that are "not
+	 * interesting".
+	 */
+	bool check_crc;
+
+	/* how much more to read in the current state */
+	uint32 remaining_size;
+
+	/* size of already written data */
+	Size already_written_size;
+
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState* XLogReaderAllocate(void);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState*);
+
+/*
+ * Reset internal state so it can be used without continuing from the last
+ * state.
+ *
+ * The callbacks and private_data won't be reset
+ */
+extern void XLogReaderReset(XLogReaderState* state);
+
+/*
+ * Read the xlog and call the appropriate callbacks as far as possible within
+ * the constraints of input data (startptr, endptr) and output space.
+ */
+extern void XLogReaderRead(XLogReaderState* state);
+
+/*
+ * Read the next xlog record if enough input/output is available.
+ *
+ * This is a bit less efficient than XLogReaderRead.
+ *
+ * Returns NULL if the next record couldn't be read for some reason. Check
+ * state->incomplete, ->needs_input, ->needs_output.
+ *
+ * Be careful to check that there is anything further to read when using
+ * ->endptr, otherwise its easy to get in an endless loop.
+ */
+extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);
+
+#endif /* READXLOG_H */

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 03/14] Add simple xlogdump tool

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 +++
src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 494 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Attachments:

0003-Add-simple-xlogdump-tool.patchtext/x-patch; name=0003-Add-simple-xlogdump-tool.patchDownload

diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
 
 ifeq ($(PORTNAME), win32)
 SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 0000000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+	 $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 0000000..0f984e4
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,468 @@
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+
+#include "getopt_long.h"
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = "xlogdump";
+
+typedef struct XLogDumpPrivateData {
+	TimeLineID timeline;
+	char* outpath;
+	char* inpath;
+} XLogDumpPrivateData;
+
+static void
+XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
+                 XLogRecPtr startptr, char *buf, Size count);
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id,
+                  XLogRecPtr startptr, const char *buf, Size count);
+
+#define XLogFilePathWrite(path, base, tli, logSegNo)			\
+	snprintf(path, MAXPGPATH, "%s/%08X%08X%08X", base, tli,		\
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),		\
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id,
+                  XLogRecPtr startptr, const char *buf, Size count)
+{
+	const char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			writebytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		path[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile >= 0)
+				close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+			XLogFilePathWrite(path, directory, timeline_id, sendSegNo);
+
+			sendFile = open(path, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
+			if (sendFile < 0)
+			{
+				ereport(ERROR,
+				        (errcode_for_file_access(),
+				         errmsg("could not open file \"%s\": %m",
+				                path)));
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+						        fname,
+								startoff)));
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		writebytes = write(sendFile, p, segbytes);
+		if (writebytes <= 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+			errmsg("could not write to log segment %s, offset %u, length %lu: %m",
+				   fname,
+				   sendOff, (unsigned long) segbytes)));
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, writebytes);
+
+		sendOff += writebytes;
+		nbytes -= writebytes;
+		p += writebytes;
+	}
+}
+
+/* this should probably be put in a general implementation */
+static void
+XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
+                 XLogRecPtr startptr, char *buf, Size count)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		fname[MAXFNAMELEN];
+			char		fpath[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile >= 0)
+				close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			snprintf(fpath, MAXPGPATH, "%s/%s",
+			         directory == NULL ? XLOGDIR : directory, fname);
+
+			sendFile = open(fpath, O_RDONLY, 0);
+			if (sendFile < 0)
+			{
+				/*
+				 * If the file is not found, assume it's because the standby
+				 * asked for a too old WAL segment that has already been
+				 * removed or recycled.
+				 */
+				if (errno == ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("requested WAL segment %s has already been removed",
+									fname)));
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m",
+									fpath)));
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+						        fname,
+								startoff)));
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		readbytes = read(sendFile, p, segbytes);
+		if (readbytes <= 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			ereport(ERROR,
+					(errcode_for_file_access(),
+			errmsg("could not read from log segment %s, offset %u, length %lu: %m",
+				   fname,
+				   sendOff, (unsigned long) segbytes)));
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, readbytes);
+
+		sendOff += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+}
+
+static void
+XLogDumpReadPage(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+    XLogPageHeader page_header;
+	XLogDumpPrivateData *private = state->private_data;
+
+    Assert((startptr % XLOG_BLCKSZ) == 0);
+
+    XLogDumpXLogRead(private->inpath, private->timeline, startptr,
+                     cur_page, XLOG_BLCKSZ);
+
+    page_header = (XLogPageHeader)cur_page;
+
+    if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+    {
+        elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+             XLOG_PAGE_MAGIC, (uint32)(startptr << 32), (uint32)startptr);
+    }
+}
+
+static void
+XLogDumpWrite(XLogReaderState* state, char* data, Size len)
+{
+	static char zero[XLOG_BLCKSZ];
+	XLogDumpPrivateData *private = state->private_data;
+
+	if (data == NULL)
+		data = zero;
+
+	if (private->outpath == NULL)
+		return;
+
+	XLogDumpXLogWrite(private->outpath, private->timeline, state->curptr,
+	                  data, len);
+}
+
+static void
+XLogDumpFinishedRecord(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+	XLogRecord *record = &buf->record;
+	const RmgrData *rmgr = &RmgrTable[record->xl_rmid];
+
+	StringInfo str = makeStringInfo();
+	initStringInfo(str);
+
+	rmgr->rm_desc(str, state->buf.record.xl_info, buf->record_data);
+
+	fprintf(stdout, "xlog record: rmgr: %-11s, record_len: %6u, tot_len: %6u, tx: %10u, lsn: %X/%-8X, prev %X/%-8X, bkp: %u%u%u%u, desc: %s\n",
+			rmgr->rm_name,
+			record->xl_len, record->xl_tot_len,
+			record->xl_xid,
+			(uint32)(buf->origptr >> 32), (uint32)buf->origptr,
+			(uint32)(record->xl_prev >> 32), (uint32)record->xl_prev,
+			!!(XLR_BKP_BLOCK(0) & buf->record.xl_info),
+			!!(XLR_BKP_BLOCK(1) & buf->record.xl_info),
+			!!(XLR_BKP_BLOCK(2) & buf->record.xl_info),
+			!!(XLR_BKP_BLOCK(3) & buf->record.xl_info),
+			str->data);
+
+}
+
+
+static void init()
+{
+	MemoryContextInit();
+	IsPostmasterEnvironment = false;
+	log_min_messages = DEBUG1;
+	Log_error_verbosity = PGERROR_TERSE;
+	pg_timezone_initialize();
+}
+
+static void
+usage(void)
+{
+	printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -v, --version          output version information, then exit\n"));
+	printf(_("  -h, --help             show this help, then exit\n"));
+	printf(_("  -s, --start            from where recptr onwards to read\n"));
+	printf(_("  -e, --end              up to which recptr to read\n"));
+	printf(_("  -t, --timeline         which timeline do we want to read\n"));
+	printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
+	printf(_("  -o, --output           where to write [start, end]\n"));
+	printf(_("  -f, --file             wal file to parse\n"));
+}
+
+int main(int argc, char **argv)
+{
+	uint32 xlogid;
+	uint32 xrecoff;
+	XLogReaderState *xlogreader_state;
+	XLogDumpPrivateData private;
+	XLogRecPtr from = InvalidXLogRecPtr;
+	XLogRecPtr to = InvalidXLogRecPtr;
+	bool bad_argument = false;
+
+	static struct option long_options[] = {
+		{"help", no_argument, NULL, 'h'},
+		{"version", no_argument, NULL, 'v'},
+		{"start", required_argument, NULL, 's'},
+		{"end", required_argument, NULL, 'e'},
+		{"timeline", required_argument, NULL, 't'},
+		{"inpath", required_argument, NULL, 'i'},
+		{"outpath", required_argument, NULL, 'o'},
+		{"file", required_argument, NULL, 'f'},
+		{NULL, 0, NULL, 0}
+	};
+	int			c;
+	int			option_index;
+
+	memset(&private, 0, sizeof(XLogDumpPrivateData));
+
+	while ((c = getopt_long(argc, argv, "hvs:e:t:i:o:f:",
+							long_options, &option_index)) != -1)
+	{
+		switch (c)
+		{
+			case 'h':
+				usage();
+				exit(0);
+				break;
+			case 'v':
+				printf("Version: 0.1\n");
+				exit(0);
+				break;
+			case 's':
+				if (sscanf(optarg, "%X/%X", &xlogid, &xrecoff) != 2)
+				{
+					bad_argument = true;
+					fprintf(stderr, "couldn't parse -s\n");
+				}
+				else
+					from = (uint64)xlogid << 32 | xrecoff;
+				break;
+			case 'e':
+				if (sscanf(optarg, "%X/%X", &xlogid, &xrecoff) != 2)
+				{
+					bad_argument = true;
+					fprintf(stderr, "couldn't parse -e\n");
+				}
+				else
+					to = (uint64)xlogid << 32 | xrecoff;
+				break;
+			case 't':
+				if (sscanf(optarg, "%d", &private.timeline) != 1)
+				{
+					bad_argument = true;
+					fprintf(stderr, "couldn't parse timeline -t\n");
+				}
+				break;
+			case 'i':
+				private.inpath = strdup(optarg);
+				break;
+			case 'o':
+				private.outpath = strdup(optarg);
+				break;
+			case 'f':
+				fprintf(stderr, "--file is not yet implemented\n");
+				bad_argument = true;
+				break;
+			default:
+				bad_argument = true;
+				break;
+		}
+	}
+
+	if (optind < argc)
+	{
+		bad_argument = true;
+		fprintf(stderr,
+		        _("%s: too many command-line arguments (first is \"%s\")\n"),
+		        progname, argv[optind]);
+	}
+
+	if (XLByteEQ(from, InvalidXLogRecPtr))
+	{
+		bad_argument = true;
+		fprintf(stderr,
+		        _("%s: -s invalid or missing: %s\n"),
+		        progname, argv[optind]);
+	}
+	else if (XLByteEQ(to, InvalidXLogRecPtr))
+	{
+		bad_argument = true;
+		fprintf(stderr,
+		        _("%s: -e invalid or missing: %s\n"),
+		        progname, argv[optind]);
+	}
+	else if (private.timeline == 0)
+	{
+		bad_argument = true;
+		fprintf(stderr,
+		        _("%s: -t invalid or missing: %s\n"),
+		        progname, argv[optind]);
+	}
+
+	if (bad_argument)
+	{
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(-1);
+	}
+
+	init();
+
+	xlogreader_state = XLogReaderAllocate();
+
+	/*
+	 * not set because we want all records, perhaps we want filtering later?
+	 * xlogreader_state->is_record_interesting =
+	 */
+	xlogreader_state->finished_record = XLogDumpFinishedRecord;
+
+	/*
+	 * not set because we do not want to copy data to somewhere yet
+	 * xlogreader_state->writeout_data = ;
+	 */
+	xlogreader_state->writeout_data = XLogDumpWrite;
+
+	xlogreader_state->read_page = XLogDumpReadPage;
+
+	xlogreader_state->private_data = &private;
+
+	xlogreader_state->startptr = from;
+	xlogreader_state->endptr = to;
+
+	XLogReaderRead(xlogreader_state);
+	XLogReaderFree(xlogreader_state);
+	return 0;
+}

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 04/14] Add a new RELFILENODE syscache to fetch a pg_class entry via (reltablespace, relfilenode)

This cache is somewhat problematic because formally indexes used by syscaches
needs to be unique, this one is not. This is "just" because of 0/InvalidOids
stored in pg_class.relfilenode for nailed/shared catalog relations. The
syscache will never be queried for InvalidOid relfilenodes however so it seems
to be safe even if it violates the rules somewhat.

It might be nicer to add infrastructure to do this properly, like using a
partial index, its not clear what the best way to do this is though.

Needs a CATVERSION bump.
---
src/backend/utils/cache/syscache.c | 11 +++++++++++
src/include/catalog/indexing.h | 2 ++
src/include/catalog/pg_proc.h | 1 +
src/include/utils/syscache.h | 1 +
4 files changed, 15 insertions(+)

Attachments:

0004-Add-a-new-RELFILENODE-syscache-to-fetch-a-pg_class-e.patchtext/x-patch; name=0004-Add-a-new-RELFILENODE-syscache-to-fetch-a-pg_class-e.patchDownload

diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ca22efd..9d2f6b7 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -613,6 +613,17 @@ static const struct cachedesc cacheinfo[] = {
 		},
 		1024
 	},
+	{RelationRelationId,		/* RELFILENODE */
+		ClassTblspcRelfilenodeIndexId,
+		2,
+		{
+			Anum_pg_class_reltablespace,
+			Anum_pg_class_relfilenode,
+			0,
+			0
+		},
+		1024
+	},
 	{RewriteRelationId,			/* RULERELNAME */
 		RewriteRelRulenameIndexId,
 		2,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 238fe58..c3db3ff 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
 #define ClassOidIndexId  2662
 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
 #define ClassNameNspIndexId  2663
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId  3171
 
 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
 #define CollationNameEncNspIndexId 3164
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f935eb1..16033c7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4673,6 +4673,7 @@ DATA(insert OID = 3473 (  spg_range_quad_leaf_consistent	PGNSP PGUID 12 1 0 0 0
 DESCR("SP-GiST support for quad tree over range");
 
 
+
 /*
  * Symbolic values for provolatile column: these indicate whether the result
  * of a function is dependent *only* on the values of its explicit arguments,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d1a9855..9a39077 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,7 @@ enum SysCacheIdentifier
 	RANGETYPE,
 	RELNAMENSP,
 	RELOID,
+	RELFILENODE,
 	RULERELNAME,
 	STATRELATTINH,
 	TABLESPACEOID,

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

---
src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
src/include/catalog/indexing.h | 4 +--
src/include/utils/relmapper.h | 2 ++
3 files changed, 57 insertions(+), 2 deletions(-)

Attachments:

0005-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchtext/x-patch; name=0005-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchDownload

diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6f21495..771f34d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -180,6 +180,59 @@ RelationMapOidToFilenode(Oid relationId, bool shared)
 	return InvalidOid;
 }
 
+/* RelationMapFilenodeToOid
+ *
+ * Do the reverse of the normal direction of mapping done in
+ * RelationMapOidToFilenode.
+ *
+ * This is not supposed to be used during normal running but rather for
+ * information purposes when looking at the filesystem or the xlog.
+ *
+ * Returns InvalidOid if the OID is not know which can easily happen if the
+ * filenode is not of a relation that is nailed or shared or if it simply
+ * doesn't exists anywhere.
+ */
+Oid
+RelationMapFilenodeToOid(Oid filenode, bool shared)
+{
+	const RelMapFile *map;
+	int32		i;
+
+	/* If there are active updates, believe those over the main maps */
+	if (shared)
+	{
+		map = &active_shared_updates;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+		map = &shared_map;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+	}
+	else
+	{
+		map = &active_local_updates;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+		map = &local_map;
+		for (i = 0; i < map->num_mappings; i++)
+		{
+			if (filenode == map->mappings[i].mapfilenode)
+				return map->mappings[i].mapoid;
+		}
+	}
+
+	return InvalidOid;
+}
+
 /*
  * RelationMapUpdateMap
  *
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index c3db3ff..81811f1 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,8 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
 #define ClassOidIndexId  2662
 DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
 #define ClassNameNspIndexId  2663
-DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3171, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
-#define ClassTblspcRelfilenodeIndexId  3171
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId  3455
 
 DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
 #define CollationNameEncNspIndexId 3164
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 111a05c..4e56508 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -36,6 +36,8 @@ typedef struct xl_relmap_update
 
 extern Oid	RelationMapOidToFilenode(Oid relationId, bool shared);
 
+extern Oid	RelationMapFilenodeToOid(Oid relationId, bool shared);
+
 extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
 					 bool immediate);

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 06/14] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

This requires the previously added RELFILENODE syscache and the added
RelationMapFilenodeToOid function added in previous commits.
---
doc/src/sgml/func.sgml | 23 +++++++++++-
src/backend/utils/adt/dbsize.c | 79 ++++++++++++++++++++++++++++++++++++++++++
src/include/catalog/pg_proc.h | 2 ++
src/include/utils/builtins.h | 1 +
4 files changed, 104 insertions(+), 1 deletion(-)

Attachments:

0006-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchtext/x-patch; name=0006-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchDownload

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f8f63d8..708da35 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15170,7 +15170,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
    <para>
     The functions shown in <xref linkend="functions-admin-dblocation"> assist
-    in identifying the specific disk files associated with database objects.
+    in identifying the specific disk files associated with database objects or doing the reverse.
    </para>
 
    <indexterm>
@@ -15179,6 +15179,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
    <indexterm>
     <primary>pg_relation_filepath</primary>
    </indexterm>
+   <indexterm>
+    <primary>pg_relation_by_filenode</primary>
+   </indexterm>
 
    <table id="functions-admin-dblocation">
     <title>Database Object Location Functions</title>
@@ -15207,6 +15210,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
         File path name of the specified relation
        </entry>
       </row>
+      <row>
+       <entry>
+        <literal><function>pg_relation_by_filenode(<parameter>tablespace</parameter> <type>oid</type>, <parameter>filenode</parameter> <type>oid</type>)</function></literal>
+        </entry>
+       <entry><type>regclass</type></entry>
+       <entry>
+        Find the associated relation of a filenode
+       </entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -15230,6 +15242,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
     the relation.
    </para>
 
+   <para>
+    <function>pg_relation_by_filenode</> is the reverse of
+    <function>pg_relation_filenode</>. Given a <quote>tablespace</> OID and
+    a <quote>filenode</> it returns the associated relation. The default
+    tablespace for user tables can be replaced with 0. Check the
+    documentation of <function>pg_relation_filenode</> for an explanation why
+    this cannot always easily answered by querying <structname>pg_class</>.
+   </para>
+
   </sect2>
 
   <sect2 id="functions-admin-genfile">
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cd23334..ec26291 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -741,6 +741,85 @@ pg_relation_filenode(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Get the relation via (reltablespace, relfilenode)
+ *
+ * This is expected to be used when somebody wants to match an individual file
+ * on the filesystem back to its table. Thats not trivially possible via
+ * pg_class because that doesn't contain the relfilenodes of shared and nailed
+ * tables.
+ *
+ * We don't fail but return NULL if we cannot find a mapping.
+ *
+ * Instead of knowing DEFAULTTABLESPACE_OID you can pass 0.
+ */
+Datum
+pg_relation_by_filenode(PG_FUNCTION_ARGS)
+{
+	Oid			reltablespace = PG_GETARG_OID(0);
+	Oid			relfilenode = PG_GETARG_OID(1);
+	Oid			lookup_tablespace = reltablespace;
+	Oid         result = InvalidOid;
+	HeapTuple	tuple;
+
+	if (reltablespace == 0)
+		reltablespace = DEFAULTTABLESPACE_OID;
+
+	/* pg_class stores 0 instead of DEFAULTTABLESPACE_OID */
+	if (reltablespace == DEFAULTTABLESPACE_OID)
+		lookup_tablespace = 0;
+
+	tuple = SearchSysCache2(RELFILENODE,
+							lookup_tablespace,
+							relfilenode);
+
+	/* found it in the system catalog, not be a shared/nailed table */
+	if (HeapTupleIsValid(tuple))
+	{
+		result = HeapTupleHeaderGetOid(tuple->t_data);
+		ReleaseSysCache(tuple);
+	}
+	else
+	{
+		if (reltablespace == GLOBALTABLESPACE_OID)
+		{
+			result = RelationMapFilenodeToOid(relfilenode, true);
+		}
+		else
+		{
+			Form_pg_class relform;
+
+			result = RelationMapFilenodeToOid(relfilenode, false);
+
+			if (result != InvalidOid)
+			{
+				/* check that we found the correct relation */
+				tuple = SearchSysCache1(RELOID,
+									result);
+
+				if (!HeapTupleIsValid(tuple))
+				{
+					elog(ERROR, "Couldn't refind previously looked up relation with oid %u",
+						 result);
+				}
+
+				relform = (Form_pg_class) GETSTRUCT(tuple);
+
+				if (relform->reltablespace != reltablespace &&
+					relform->reltablespace != lookup_tablespace)
+					result = InvalidOid;
+
+				ReleaseSysCache(tuple);
+			}
+		}
+	}
+
+	if (!OidIsValid(result))
+		PG_RETURN_NULL();
+	else
+		PG_RETURN_OID(result);
+}
+
+/*
  * Get the pathname (relative to $PGDATA) of a relation
  *
  * See comments for pg_relation_filenode.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 16033c7..d28db63 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3436,6 +3436,8 @@ DATA(insert OID = 2998 ( pg_indexes_size		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1
 DESCR("disk space usage for all indexes attached to the specified table");
 DATA(insert OID = 2999 ( pg_relation_filenode	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "2205" _null_ _null_ _null_ _null_ pg_relation_filenode _null_ _null_ _null_ ));
 DESCR("filenode identifier of relation");
+DATA(insert OID = 3454 ( pg_relation_by_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 2205 "26 26" _null_ _null_ _null_ _null_ pg_relation_by_filenode _null_ _null_ _null_ ));
+DESCR("filenode identifier of relation");
 DATA(insert OID = 3034 ( pg_relation_filepath	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "2205" _null_ _null_ _null_ _null_ pg_relation_filepath _null_ _null_ _null_ ));
 DESCR("file path of relation");
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 5bc3a75..e30b8c4 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -458,6 +458,7 @@ extern Datum pg_table_size(PG_FUNCTION_ARGS);
 extern Datum pg_indexes_size(PG_FUNCTION_ARGS);
 extern Datum pg_relation_filenode(PG_FUNCTION_ARGS);
 extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
+extern Datum pg_relation_by_filenode(PG_FUNCTION_ARGS);
 
 /* genfile.c */
 extern bytea *read_binary_file(const char *filename,

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 07/14] Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement

This is useful to be able to represent a CommandId thats invalid. There was no
such value before.

This decreases the possible number of subtransactions by one which seems
unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are
never looked at outside the context of their own transaction (spare timetravel
access, but thats new anyway).
---
src/backend/access/transam/xact.c | 4 ++--
src/include/c.h | 1 +
2 files changed, 3 insertions(+), 2 deletions(-)

Attachments:

0007-Introduce-InvalidCommandId-and-declare-that-to-be-th.patchtext/x-patch; name=0007-Introduce-InvalidCommandId-and-declare-that-to-be-th.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 10386da..f28b4c8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -745,12 +745,12 @@ CommandCounterIncrement(void)
 	if (currentCommandIdUsed)
 	{
 		currentCommandId += 1;
-		if (currentCommandId == FirstCommandId) /* check for overflow */
+		if (currentCommandId == InvalidCommandId)
 		{
 			currentCommandId -= 1;
 			ereport(ERROR,
 					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot have more than 2^32-1 commands in a transaction")));
+					 errmsg("cannot have more than 2^32-2 commands in a transaction")));
 		}
 		currentCommandIdUsed = false;
 
diff --git a/src/include/c.h b/src/include/c.h
index a6c0e6e..e52af3b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -367,6 +367,7 @@ typedef uint32 MultiXactOffset;
 typedef uint32 CommandId;
 
 #define FirstCommandId	((CommandId) 0)
+#define InvalidCommandId	(~(CommandId)0)
 
 /*
  * Array indexing support

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.
Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.
---
src/backend/access/transam/xlog.c | 2 ++
src/backend/storage/ipc/procarray.c | 65 ++++++++++++++++++++++++-------------
src/backend/storage/ipc/standby.c | 8 +++--
src/include/storage/standby.h | 2 ++
4 files changed, 52 insertions(+), 25 deletions(-)

Attachments:

0008-Store-the-number-of-subtransactions-in-xl_running_xa.patchtext/x-patch; name=0008-Store-the-number-of-subtransactions-in-xl_running_xa.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1faf666..1749f46 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5629,6 +5629,7 @@ StartupXLOG(void)
 				 * subxids are listed with their parent prepared transactions.
 				 */
 				running.xcnt = nxids;
+				running.subxcnt = 0;
 				running.subxid_overflow = false;
 				running.nextXid = checkPoint.nextXid;
 				running.oldestRunningXid = oldestActiveXID;
@@ -7813,6 +7814,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
+			running.subxcnt = 0;
 			running.subxid_overflow = false;
 			running.nextXid = checkPoint.nextXid;
 			running.oldestRunningXid = oldestActiveXID;
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 8c0d7b0..a98358d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -501,6 +501,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Remove stale transactions, if any.
 	 */
 	ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid);
+
+	/*
+	 * Remove stale locks, if any.
+	 *
+	 * Locks are always assigned to the toplevel xid so we don't need to care
+	 * about subxcnt/subxids (and by extension not about ->suboverflowed).
+	 */
 	StandbyReleaseOldLocks(running->xcnt, running->xids);
 
 	/*
@@ -581,13 +588,13 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 * Allocate a temporary array to avoid modifying the array passed as
 	 * argument.
 	 */
-	xids = palloc(sizeof(TransactionId) * running->xcnt);
+	xids = palloc(sizeof(TransactionId) * (running->xcnt + running->subxcnt));
 
 	/*
 	 * Add to the temp array any xids which have not already completed.
 	 */
 	nxids = 0;
-	for (i = 0; i < running->xcnt; i++)
+	for (i = 0; i < running->xcnt + running->subxcnt; i++)
 	{
 		TransactionId xid = running->xids[i];
 
@@ -1627,15 +1634,13 @@ GetRunningTransactionData(void)
 	oldestRunningXid = ShmemVariableCache->nextXid;
 
 	/*
-	 * Spin over procArray collecting all xids and subxids.
+	 * Spin over procArray collecting all xids
 	 */
 	for (index = 0; index < arrayP->numProcs; index++)
 	{
 		int			pgprocno = arrayP->pgprocnos[index];
-		volatile PGPROC *proc = &allProcs[pgprocno];
 		volatile PGXACT *pgxact = &allPgXact[pgprocno];
 		TransactionId xid;
-		int			nxids;
 
 		/* Fetch xid just once - see GetNewTransactionId */
 		xid = pgxact->xid;
@@ -1652,30 +1657,46 @@ GetRunningTransactionData(void)
 		if (TransactionIdPrecedes(xid, oldestRunningXid))
 			oldestRunningXid = xid;
 
-		/*
-		 * Save subtransaction XIDs. Other backends can't add or remove
-		 * entries while we're holding XidGenLock.
-		 */
-		nxids = pgxact->nxids;
-		if (nxids > 0)
-		{
-			memcpy(&xids[count], (void *) proc->subxids.xids,
-				   nxids * sizeof(TransactionId));
-			count += nxids;
-			subcount += nxids;
+		if (pgxact->overflowed)
+			suboverflowed = true;
+	}
 
-			if (pgxact->overflowed)
-				suboverflowed = true;
+	/*
+	 * Spin over procArray collecting all subxids, but only if there hasn't
+	 * been a suboverflow.
+	 */
+	if (!suboverflowed)
+	{
+		for (index = 0; index < arrayP->numProcs; index++)
+		{
+			int			pgprocno = arrayP->pgprocnos[index];
+			volatile PGPROC *proc = &allProcs[pgprocno];
+			volatile PGXACT *pgxact = &allPgXact[pgprocno];
+			int			nxids;
 
 			/*
-			 * Top-level XID of a transaction is always less than any of its
-			 * subxids, so we don't need to check if any of the subxids are
-			 * smaller than oldestRunningXid
+			 * Save subtransaction XIDs. Other backends can't add or remove
+			 * entries while we're holding XidGenLock.
 			 */
+			nxids = pgxact->nxids;
+			if (nxids > 0)
+			{
+				memcpy(&xids[count], (void *) proc->subxids.xids,
+					   nxids * sizeof(TransactionId));
+				count += nxids;
+				subcount += nxids;
+
+				/*
+				 * Top-level XID of a transaction is always less than any of
+				 * its subxids, so we don't need to check if any of the subxids
+				 * are smaller than oldestRunningXid
+				 */
+			}
 		}
 	}
 
-	CurrentRunningXacts->xcnt = count;
+	CurrentRunningXacts->xcnt = count - subcount;
+	CurrentRunningXacts->subxcnt = subcount;
 	CurrentRunningXacts->subxid_overflow = suboverflowed;
 	CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid;
 	CurrentRunningXacts->oldestRunningXid = oldestRunningXid;
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 905d331..0cab243 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -775,6 +775,7 @@ standby_redo(XLogRecPtr lsn, XLogRecord *record)
 		RunningTransactionsData running;
 
 		running.xcnt = xlrec->xcnt;
+		running.subxcnt = xlrec->subxcnt;
 		running.subxid_overflow = xlrec->subxid_overflow;
 		running.nextXid = xlrec->nextXid;
 		running.latestCompletedXid = xlrec->latestCompletedXid;
@@ -942,6 +943,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	XLogRecPtr	recptr;
 
 	xlrec.xcnt = CurrRunningXacts->xcnt;
+	xlrec.subxcnt = CurrRunningXacts->subxcnt;
 	xlrec.subxid_overflow = CurrRunningXacts->subxid_overflow;
 	xlrec.nextXid = CurrRunningXacts->nextXid;
 	xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid;
@@ -957,7 +959,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 	{
 		rdata[0].next = &(rdata[1]);
 		rdata[1].data = (char *) CurrRunningXacts->xids;
-		rdata[1].len = xlrec.xcnt * sizeof(TransactionId);
+		rdata[1].len = (xlrec.xcnt + xlrec.subxcnt) * sizeof(TransactionId);
 		rdata[1].buffer = InvalidBuffer;
 		lastrdata = 1;
 	}
@@ -976,8 +978,8 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 			 CurrRunningXacts->nextXid);
 	else
 		elog(trace_recovery(DEBUG2),
-			 "snapshot of %u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
-			 CurrRunningXacts->xcnt,
+			 "snapshot of %u+%u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
+			 CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
 			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 7024fc4..f917b89 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -68,6 +68,7 @@ typedef struct xl_standby_locks
 typedef struct xl_running_xacts
 {
 	int			xcnt;			/* # of xact ids in xids[] */
+	int			subxcnt;			/* # of subxact ids in xids[] */
 	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */
@@ -98,6 +99,7 @@ extern void standby_desc(StringInfo buf, uint8 xl_info, char *rec);
 typedef struct RunningTransactionsData
 {
 	int			xcnt;			/* # of xact ids in xids[] */
+	int			subxcnt;			/* # of subxact ids in xids[] */
 	bool		subxid_overflow;	/* snapshot overflowed, subxids missing */
 	TransactionId nextXid;		/* copy of ShmemVariableCache->nextXid */
 	TransactionId oldestRunningXid;		/* *not* oldestXmin */

#10

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader

For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.
---
contrib/pgrowlocks/pgrowlocks.c | 2 +-
src/backend/access/heap/heapam.c | 13 ++++++----
src/backend/access/heap/pruneheap.c | 16 ++++++++++--
src/backend/catalog/index.c | 2 +-
src/backend/commands/analyze.c | 3 ++-
src/backend/commands/cluster.c | 2 +-
src/backend/commands/vacuumlazy.c | 3 ++-
src/backend/storage/lmgr/predicate.c | 2 +-
src/backend/utils/time/tqual.c | 50 +++++++++++++++++++++++++++++-------
src/include/utils/snapshot.h | 4 +--
src/include/utils/tqual.h | 20 +++++++--------
11 files changed, 83 insertions(+), 34 deletions(-)

Attachments:

0009-Adjust-all-Satisfies-routines-to-take-a-HeapTuple-in.patchtext/x-patch; name=0009-Adjust-all-Satisfies-routines-to-take-a-HeapTuple-in.patchDownload

diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c
index 20beed2..8f9db55 100644
--- a/contrib/pgrowlocks/pgrowlocks.c
+++ b/contrib/pgrowlocks/pgrowlocks.c
@@ -120,7 +120,7 @@ pgrowlocks(PG_FUNCTION_ARGS)
 		/* must hold a buffer lock to call HeapTupleSatisfiesUpdate */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-		if (HeapTupleSatisfiesUpdate(tuple->t_data,
+		if (HeapTupleSatisfiesUpdate(tuple,
 									 GetCurrentCommandId(false),
 									 scan->rs_cbuf) == HeapTupleBeingUpdated)
 		{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 64aecf2..d025ff7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -276,6 +276,7 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 			HeapTupleData loctup;
 			bool		valid;
 
+			loctup.t_tableOid = RelationGetRelid(scan->rs_rd);
 			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
 			loctup.t_len = ItemIdGetLength(lpp);
 			ItemPointerSet(&(loctup.t_self), page, lineoff);
@@ -1590,7 +1591,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 
 		heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
 		heapTuple->t_len = ItemIdGetLength(lp);
-		heapTuple->t_tableOid = relation->rd_id;
+		heapTuple->t_tableOid = RelationGetRelid(relation);
 		heapTuple->t_self = *tid;
 
 		/*
@@ -1638,7 +1639,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * transactions.
 		 */
 		if (all_dead && *all_dead &&
-			!HeapTupleIsSurelyDead(heapTuple->t_data, RecentGlobalXmin))
+			!HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
 			*all_dead = false;
 
 		/*
@@ -2418,12 +2419,13 @@ heap_delete(Relation relation, ItemPointer tid,
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tp.t_tableOid = RelationGetRelid(relation);
 	tp.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tp.t_len = ItemIdGetLength(lp);
 	tp.t_self = *tid;
 
 l1:
-	result = HeapTupleSatisfiesUpdate(tp.t_data, cid, buffer);
+	result = HeapTupleSatisfiesUpdate(&tp, cid, buffer);
 
 	if (result == HeapTupleInvisible)
 	{
@@ -2788,6 +2790,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(otid));
 	Assert(ItemIdIsNormal(lp));
 
+	oldtup.t_tableOid = RelationGetRelid(relation);
 	oldtup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	oldtup.t_len = ItemIdGetLength(lp);
 	oldtup.t_self = *otid;
@@ -2800,7 +2803,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	 */
 
 l2:
-	result = HeapTupleSatisfiesUpdate(oldtup.t_data, cid, buffer);
+	result = HeapTupleSatisfiesUpdate(&oldtup, cid, buffer);
 
 	if (result == HeapTupleInvisible)
 	{
@@ -3502,7 +3505,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	tuple->t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple->t_data, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
 
 	if (result == HeapTupleInvisible)
 	{
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 97a2868..edb3a09 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -340,6 +340,9 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 	OffsetNumber chainitems[MaxHeapTuplesPerPage];
 	int			nchain = 0,
 				i;
+	HeapTupleData tup;
+
+	tup.t_tableOid = RelationGetRelid(relation);
 
 	rootlp = PageGetItemId(dp, rootoffnum);
 
@@ -349,6 +352,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 	if (ItemIdIsNormal(rootlp))
 	{
 		htup = (HeapTupleHeader) PageGetItem(dp, rootlp);
+
+		tup.t_data = htup;
+		tup.t_len = ItemIdGetLength(rootlp);
+		ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), rootoffnum);
+
 		if (HeapTupleHeaderIsHeapOnly(htup))
 		{
 			/*
@@ -369,7 +377,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			 * either here or while following a chain below.  Whichever path
 			 * gets there first will mark the tuple unused.
 			 */
-			if (HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer)
+			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
@@ -432,6 +440,10 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		Assert(ItemIdIsNormal(lp));
 		htup = (HeapTupleHeader) PageGetItem(dp, lp);
 
+		tup.t_data = htup;
+		tup.t_len = ItemIdGetLength(lp);
+		ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
+
 		/*
 		 * Check the tuple XMIN against prior XMAX, if any
 		 */
@@ -449,7 +461,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		tupdead = recent_dead = false;
 
-		switch (HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer))
+		switch (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer))
 		{
 			case HEAPTUPLE_DEAD:
 				tupdead = true;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d2d91c1..18d0c5a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2267,7 +2267,7 @@ IndexBuildHeapScan(Relation heapRelation,
 			 */
 			LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-			switch (HeapTupleSatisfiesVacuum(heapTuple->t_data, OldestXmin,
+			switch (HeapTupleSatisfiesVacuum(heapTuple, OldestXmin,
 											 scan->rs_cbuf))
 			{
 				case HEAPTUPLE_DEAD:
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9612a27..d9b971c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1134,10 +1134,11 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 			ItemPointerSet(&targtuple.t_self, targblock, targoffset);
 
+			targtuple.t_tableOid = RelationGetRelid(onerel);
 			targtuple.t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
 			targtuple.t_len = ItemIdGetLength(itemid);
 
-			switch (HeapTupleSatisfiesVacuum(targtuple.t_data,
+			switch (HeapTupleSatisfiesVacuum(&targtuple,
 											 OldestXmin,
 											 targbuffer))
 			{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index de71a35..cc36f36 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -923,7 +923,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
 		{
 			case HEAPTUPLE_DEAD:
 				/* Definitely dead */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c9253a9..0af174e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -705,12 +705,13 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 
 			Assert(ItemIdIsNormal(itemid));
 
+			tuple.t_tableOid = RelationGetRelid(onerel);
 			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 			tuple.t_len = ItemIdGetLength(itemid);
 
 			tupgone = false;
 
-			switch (HeapTupleSatisfiesVacuum(tuple.t_data, OldestXmin, buf))
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
 			{
 				case HEAPTUPLE_DEAD:
 
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index f164cfd..b376652 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -3894,7 +3894,7 @@ CheckForSerializableConflictOut(bool visible, Relation relation,
 	 * tuple is visible to us, while HeapTupleSatisfiesVacuum checks what else
 	 * is going on with it.
 	 */
-	htsvResult = HeapTupleSatisfiesVacuum(tuple->t_data, TransactionXmin, buffer);
+	htsvResult = HeapTupleSatisfiesVacuum(tuple, TransactionXmin, buffer);
 	switch (htsvResult)
 	{
 		case HEAPTUPLE_LIVE:
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index b531db5..f64d52d 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -163,8 +163,12 @@ HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
  *			 Xmax is not committed)))			that has not been committed
  */
 bool
-HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
@@ -326,8 +330,12 @@ HeapTupleSatisfiesSelf(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
  *
  */
 bool
-HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+HeapTupleSatisfiesNow(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
@@ -471,7 +479,7 @@ HeapTupleSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
  *		Dummy "satisfies" routine: any tuple satisfies SnapshotAny.
  */
 bool
-HeapTupleSatisfiesAny(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+HeapTupleSatisfiesAny(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 {
 	return true;
 }
@@ -491,9 +499,13 @@ HeapTupleSatisfiesAny(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
  * table.
  */
 bool
-HeapTupleSatisfiesToast(HeapTupleHeader tuple, Snapshot snapshot,
+HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
 						Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
@@ -572,9 +584,13 @@ HeapTupleSatisfiesToast(HeapTupleHeader tuple, Snapshot snapshot,
  *	distinguish that case must test for it themselves.)
  */
 HTSU_Result
-HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
+HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 						 Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
@@ -739,9 +755,13 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
  * for snapshot->xmax and the tuple's xmax.
  */
 bool
-HeapTupleSatisfiesDirty(HeapTupleHeader tuple, Snapshot snapshot,
+HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
 						Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	snapshot->xmin = snapshot->xmax = InvalidTransactionId;
 
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
@@ -902,9 +922,13 @@ HeapTupleSatisfiesDirty(HeapTupleHeader tuple, Snapshot snapshot,
  * can't see it.)
  */
 bool
-HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
+HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 					   Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
 	{
 		if (tuple->t_infomask & HEAP_XMIN_INVALID)
@@ -1058,9 +1082,13 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
  * even if we see that the deleting transaction has committed.
  */
 HTSV_Result
-HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
+HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 						 Buffer buffer)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	/*
 	 * Has inserting transaction committed?
 	 *
@@ -1233,8 +1261,12 @@ HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
  *	just whether or not the tuple is surely dead).
  */
 bool
-HeapTupleIsSurelyDead(HeapTupleHeader tuple, TransactionId OldestXmin)
+HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 {
+	HeapTupleHeader tuple = htup->t_data;
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
 	/*
 	 * If the inserting transaction is marked invalid, then it aborted, and
 	 * the tuple is definitely dead.  If it's marked neither committed nor
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 900272e..0e86258 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -27,8 +27,8 @@ typedef struct SnapshotData *Snapshot;
  * The specific semantics of a snapshot are encoded by the "satisfies"
  * function.
  */
-typedef bool (*SnapshotSatisfiesFunc) (HeapTupleHeader tuple,
-										   Snapshot snapshot, Buffer buffer);
+typedef bool (*SnapshotSatisfiesFunc) (HeapTuple htup,
+									   Snapshot snapshot, Buffer buffer);
 
 typedef struct SnapshotData
 {
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index ff74f86..b129ae9 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -52,7 +52,7 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
  *	if so, the indicated buffer is marked dirty.
  */
 #define HeapTupleSatisfiesVisibility(tuple, snapshot, buffer) \
-	((*(snapshot)->satisfies) ((tuple)->t_data, snapshot, buffer))
+	((*(snapshot)->satisfies) (tuple, snapshot, buffer))
 
 /* Result codes for HeapTupleSatisfiesVacuum */
 typedef enum
@@ -65,25 +65,25 @@ typedef enum
 } HTSV_Result;
 
 /* These are the "satisfies" test routines for the various snapshot types */
-extern bool HeapTupleSatisfiesMVCC(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesMVCC(HeapTuple htup,
 					   Snapshot snapshot, Buffer buffer);
-extern bool HeapTupleSatisfiesNow(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesNow(HeapTuple htup,
 					  Snapshot snapshot, Buffer buffer);
-extern bool HeapTupleSatisfiesSelf(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesSelf(HeapTuple htup,
 					   Snapshot snapshot, Buffer buffer);
-extern bool HeapTupleSatisfiesAny(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesAny(HeapTuple htup,
 					  Snapshot snapshot, Buffer buffer);
-extern bool HeapTupleSatisfiesToast(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesToast(HeapTuple htup,
 						Snapshot snapshot, Buffer buffer);
-extern bool HeapTupleSatisfiesDirty(HeapTupleHeader tuple,
+extern bool HeapTupleSatisfiesDirty(HeapTuple htup,
 						Snapshot snapshot, Buffer buffer);
 
 /* Special "satisfies" routines with different APIs */
-extern HTSU_Result HeapTupleSatisfiesUpdate(HeapTupleHeader tuple,
+extern HTSU_Result HeapTupleSatisfiesUpdate(HeapTuple htup,
 						 CommandId curcid, Buffer buffer);
-extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTupleHeader tuple,
+extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTuple htup,
 						 TransactionId OldestXmin, Buffer buffer);
-extern bool HeapTupleIsSurelyDead(HeapTupleHeader tuple,
+extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 					  TransactionId OldestXmin);
 
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,

#11

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 10/14] Allow walsender's to connect to a specific database

Currently the decision whether to connect to a database or not is made by
checking whether the passed "dbname" parameter is "replication". Unfortunately
this makes it impossible to connect a to a database named replication...

This is useful for future walsender commands which need database interaction.
---
src/backend/postmaster/postmaster.c | 7 ++++--
.../libpqwalreceiver/libpqwalreceiver.c | 4 ++--
src/backend/replication/walsender.c | 27 ++++++++++++++++++----
src/backend/utils/init/postinit.c | 5 ++++
src/bin/pg_basebackup/pg_basebackup.c | 4 ++--
src/bin/pg_basebackup/pg_receivexlog.c | 4 ++--
src/bin/pg_basebackup/receivelog.c | 4 ++--
7 files changed, 41 insertions(+), 14 deletions(-)

Attachments:

0010-Allow-walsender-s-to-connect-to-a-specific-database.patchtext/x-patch; name=0010-Allow-walsender-s-to-connect-to-a-specific-database.patchDownload

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b223fee..05048bc 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1806,10 +1806,13 @@ retry1:
 	if (strlen(port->user_name) >= NAMEDATALEN)
 		port->user_name[NAMEDATALEN - 1] = '\0';
 
-	/* Walsender is not related to a particular database */
-	if (am_walsender)
+	/* Generic Walsender is not related to a particular database */
+	if (am_walsender && strcmp(port->database_name, "replication") == 0)
 		port->database_name[0] = '\0';
 
+	if (am_walsender)
+		elog(WARNING, "connecting to %s", port->database_name);
+
 	/*
 	 * Done putting stuff in TopMemoryContext.
 	 */
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index bfaebea..c39062b 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -114,7 +114,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
 						"the primary server: %s",
 						PQerrorMessage(streamConn))));
 	}
-	if (PQnfields(res) != 3 || PQntuples(res) != 1)
+	if (PQnfields(res) != 4 || PQntuples(res) != 1)
 	{
 		int			ntuples = PQntuples(res);
 		int			nfields = PQnfields(res);
@@ -122,7 +122,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
 		PQclear(res);
 		ereport(ERROR,
 				(errmsg("invalid response from primary server"),
-				 errdetail("Expected 1 tuple with 3 fields, got %d tuples with %d fields.",
+				 errdetail("Expected 1 tuple with 4 fields, got %d tuples with %d fields.",
 						   ntuples, nfields)));
 	}
 	primary_sysid = PQgetvalue(res, 0, 0);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8774d7e..6452c34 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -40,6 +40,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_type.h"
+#include "commands/dbcommands.h"
 #include "funcapi.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -202,10 +203,12 @@ IdentifySystem(void)
 	char		tli[11];
 	char		xpos[MAXFNAMELEN];
 	XLogRecPtr	logptr;
+	char*        dbname = NULL;
 
 	/*
-	 * Reply with a result set with one row, three columns. First col is
-	 * system ID, second is timeline ID, and third is current xlog location.
+	 * Reply with a result set with one row, four columns. First col is system
+	 * ID, second is timeline ID, third is current xlog location and the fourth
+	 * contains the database name if we are connected to one.
 	 */
 
 	snprintf(sysid, sizeof(sysid), UINT64_FORMAT,
@@ -216,9 +219,14 @@ IdentifySystem(void)
 
 	snprintf(xpos, sizeof(xpos), "%X/%X", (uint32) (logptr >> 32), (uint32) logptr);
 
+	if (MyDatabaseId != InvalidOid)
+		dbname = get_database_name(MyDatabaseId);
+	else
+		dbname = "(none)";
+
 	/* Send a RowDescription message */
 	pq_beginmessage(&buf, 'T');
-	pq_sendint(&buf, 3, 2);		/* 3 fields */
+	pq_sendint(&buf, 4, 2);		/* 4 fields */
 
 	/* first field */
 	pq_sendstring(&buf, "systemid");	/* col name */
@@ -246,17 +254,28 @@ IdentifySystem(void)
 	pq_sendint(&buf, -1, 2);
 	pq_sendint(&buf, 0, 4);
 	pq_sendint(&buf, 0, 2);
+
+	/* fourth field */
+	pq_sendstring(&buf, "dbname");
+	pq_sendint(&buf, 0, 4);
+	pq_sendint(&buf, 0, 2);
+	pq_sendint(&buf, TEXTOID, 4);
+	pq_sendint(&buf, -1, 2);
+	pq_sendint(&buf, 0, 4);
+	pq_sendint(&buf, 0, 2);
 	pq_endmessage(&buf);
 
 	/* Send a DataRow message */
 	pq_beginmessage(&buf, 'D');
-	pq_sendint(&buf, 3, 2);		/* # of columns */
+	pq_sendint(&buf, 4, 2);		/* # of columns */
 	pq_sendint(&buf, strlen(sysid), 4); /* col1 len */
 	pq_sendbytes(&buf, (char *) &sysid, strlen(sysid));
 	pq_sendint(&buf, strlen(tli), 4);	/* col2 len */
 	pq_sendbytes(&buf, (char *) tli, strlen(tli));
 	pq_sendint(&buf, strlen(xpos), 4);	/* col3 len */
 	pq_sendbytes(&buf, (char *) xpos, strlen(xpos));
+	pq_sendint(&buf, strlen(dbname), 4);	/* col4 len */
+	pq_sendbytes(&buf, (char *) dbname, strlen(dbname));
 
 	pq_endmessage(&buf);
 }
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2eb456d..3463d3d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -690,7 +690,12 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 			ereport(FATAL,
 					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 					 errmsg("must be superuser or replication role to start walsender")));
+	}
 
+	if (am_walsender &&
+	    (in_dbname == NULL || in_dbname[0] == '\0') &&
+	    dboid == InvalidOid)
+	{
 		/* process any options passed in the startup packet */
 		if (MyProcPort != NULL)
 			process_startup_options(MyProcPort, am_superuser);
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 4f22116..48c68e9 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -970,11 +970,11 @@ BaseBackup(void)
 				progname, "IDENTIFY_SYSTEM", PQerrorMessage(conn));
 		disconnect_and_exit(1);
 	}
-	if (PQntuples(res) != 1 || PQnfields(res) != 3)
+	if (PQntuples(res) != 1 || PQnfields(res) != 4)
 	{
 		fprintf(stderr,
 				_("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
-				progname, PQntuples(res), PQnfields(res), 1, 3);
+				progname, PQntuples(res), PQnfields(res), 1, 4);
 		disconnect_and_exit(1);
 	}
 	sysidentifier = pg_strdup(PQgetvalue(res, 0, 0));
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 843fc69..d6677ca 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -242,11 +242,11 @@ StreamLog(void)
 				progname, "IDENTIFY_SYSTEM", PQerrorMessage(conn));
 		disconnect_and_exit(1);
 	}
-	if (PQntuples(res) != 1 || PQnfields(res) != 3)
+	if (PQntuples(res) != 1 || PQnfields(res) != 4)
 	{
 		fprintf(stderr,
 				_("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
-				progname, PQntuples(res), PQnfields(res), 1, 3);
+				progname, PQntuples(res), PQnfields(res), 1, 4);
 		disconnect_and_exit(1);
 	}
 	timeline = atoi(PQgetvalue(res, 0, 1));
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index de82ff5..b07e522 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -363,11 +363,11 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			PQclear(res);
 			return false;
 		}
-		if (PQnfields(res) != 3 || PQntuples(res) != 1)
+		if (PQnfields(res) != 4 || PQntuples(res) != 1)
 		{
 			fprintf(stderr,
 					_("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
-					progname, PQntuples(res), PQnfields(res), 1, 3);
+					progname, PQntuples(res), PQnfields(res), 1, 4);
 			PQclear(res);
 			return false;
 		}

#12

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 11/14] Introduce wal decoding via catalog timetravel

This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks
* init
* begin
* change
* commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough
* INIT_LOGICAL_REPLICATION $plugin; waits till a consistent snapshot is built and returns
* initial LSN
* replication slot identifier
* id of a pg_export() style snapshot
* START_LOGICAL_REPLICATION $id $lsn; streams out changes
* uses named output plugins for output specification

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning
transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after
INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---
src/backend/access/heap/heapam.c | 280 +++++-
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 ++
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/decode.c | 496 ++++++++++
src/backend/replication/logical/logicalfuncs.c | 247 +++++
src/backend/replication/logical/reorderbuffer.c | 1156 +++++++++++++++++++++++
src/backend/replication/logical/snapbuild.c | 1144 ++++++++++++++++++++++
src/backend/replication/repl_gram.y | 32 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 566 ++++++++++-
src/backend/storage/ipc/procarray.c | 23 +
src/backend/storage/ipc/standby.c | 8 +-
src/backend/utils/cache/inval.c | 2 +-
src/backend/utils/cache/relcache.c | 3 +-
src/backend/utils/misc/guc.c | 11 +
src/backend/utils/time/tqual.c | 249 +++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/heapam_xlog.h | 23 +
src/include/access/transam.h | 5 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
src/include/nodes/nodes.h | 2 +
src/include/nodes/replnodes.h | 22 +
src/include/replication/decode.h | 21 +
src/include/replication/logicalfuncs.h | 44 +
src/include/replication/output_plugin.h | 76 ++
src/include/replication/reorderbuffer.h | 284 ++++++
src/include/replication/snapbuild.h | 128 +++
src/include/replication/walsender.h | 1 +
src/include/replication/walsender_private.h | 34 +-
src/include/storage/itemptr.h | 3 +
src/include/storage/sinval.h | 2 +
src/include/utils/tqual.h | 31 +-
35 files changed, 4966 insertions(+), 34 deletions(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/reorderbuffer.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/logicalfuncs.h
create mode 100644 src/include/replication/output_plugin.h
create mode 100644 src/include/replication/reorderbuffer.h
create mode 100644 src/include/replication/snapbuild.h

Attachments:

0011-Introduce-wal-decoding-via-catalog-timetravel.patchtext/x-patch; name=0011-Introduce-wal-decoding-via-catalog-timetravel.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d025ff7..7765cae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -86,6 +87,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
 				bool all_visible_cleared, bool new_all_visible_cleared);
+static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 					   HeapTuple oldtup, HeapTuple newtup);
 
@@ -1618,10 +1620,16 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 */
 		if (!skip)
 		{
+			/* setup the redirected t_self for the benefit of timetravel access */
+			ItemPointerSet(&(heapTuple->t_self), BufferGetBlockNumber(buffer), offnum);
+
 			/* If it's visible per the snapshot, we must return it */
 			valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);
 			CheckForSerializableConflictOut(valid, relation, heapTuple,
 											buffer, snapshot);
+			/* reset original, non-redirected, tid */
+			heapTuple->t_self = *tid;
+
 			if (valid)
 			{
 				ItemPointerSetOffsetNumber(tid, offnum);
@@ -1960,10 +1968,24 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[3];
+		XLogRecData rdata[4];
 		Page		page = BufferGetPage(buffer);
 		uint8		info = XLOG_HEAP_INSERT;
 
+		/*
+		 * For the logical replication case we need the tuple even if were
+		 * doing a full page write. We could alternatively store a pointer into
+		 * the fpw though.
+		 * For that to work we add another rdata entry for the buffer in that
+		 * case.
+		 */
+		bool        need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL
+			&& RelationGetRelid(relation)  >= FirstNormalObjectId;
+
+		/* For logical decode we need combocids to properly decode the catalog */
+		if (wal_level >= WAL_LEVEL_LOGICAL && RelationGetRelid(relation)  < FirstNormalObjectId)
+			log_heap_new_cid(relation, heaptup);
+
 		xlrec.all_visible_cleared = all_visible_cleared;
 		xlrec.target.node = relation->rd_node;
 		xlrec.target.tid = heaptup->t_self;
@@ -1983,18 +2005,33 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		 */
 		rdata[1].data = (char *) &xlhdr;
 		rdata[1].len = SizeOfHeapHeader;
-		rdata[1].buffer = buffer;
+		rdata[1].buffer = need_tuple_data ? InvalidBuffer : buffer;
 		rdata[1].buffer_std = true;
 		rdata[1].next = &(rdata[2]);
 
 		/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
 		rdata[2].data = (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits);
 		rdata[2].len = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
-		rdata[2].buffer = buffer;
+		rdata[2].buffer = need_tuple_data ? InvalidBuffer : buffer;
 		rdata[2].buffer_std = true;
 		rdata[2].next = NULL;
 
 		/*
+		 * add record for the buffer without actual content thats removed if
+		 * fpw is done for that buffer
+		 */
+		if (need_tuple_data)
+		{
+			rdata[2].next = &(rdata[3]);
+
+			rdata[3].data = NULL;
+			rdata[3].len = 0;
+			rdata[3].buffer = buffer;
+			rdata[3].buffer_std = true;
+			rdata[3].next = NULL;
+		}
+
+		/*
 		 * If this is the single and first tuple on page, we can reinit the
 		 * page instead of restoring the whole thing.  Set flag, and hide
 		 * buffer references from XLogInsert.
@@ -2003,7 +2040,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 		{
 			info |= XLOG_HEAP_INIT_PAGE;
-			rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
+			rdata[1].buffer = rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
 		}
 
 		recptr = XLogInsert(RM_HEAP_ID, info, rdata);
@@ -2123,6 +2160,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	Page		page;
 	bool		needwal;
 	Size		saveFreeSpace;
+	bool        need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL
+		&& RelationGetRelid(relation)  >= FirstNormalObjectId;
+	bool        need_cids = wal_level >= WAL_LEVEL_LOGICAL &&
+		RelationGetRelid(relation)  < FirstNormalObjectId;
 
 	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
 	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
@@ -2205,7 +2246,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		{
 			XLogRecPtr	recptr;
 			xl_heap_multi_insert *xlrec;
-			XLogRecData rdata[2];
+			XLogRecData rdata[3];
 			uint8		info = XLOG_HEAP2_MULTI_INSERT;
 			char	   *tupledata;
 			int			totaldatalen;
@@ -2267,6 +2308,15 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 					   datalen);
 				tuphdr->datalen = datalen;
 				scratchptr += datalen;
+
+				/*
+				 * We don't use heap_multi_insert for catalog tuples yet, but
+				 * better be prepared...
+				 */
+				if (need_cids)
+				{
+					log_heap_new_cid(relation, heaptup);
+				}
 			}
 			totaldatalen = scratchptr - tupledata;
 			Assert((scratchptr - scratch) < BLCKSZ);
@@ -2278,17 +2328,32 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 
 			rdata[1].data = tupledata;
 			rdata[1].len = totaldatalen;
-			rdata[1].buffer = buffer;
+			rdata[1].buffer = need_tuple_data ? InvalidBuffer : buffer;
 			rdata[1].buffer_std = true;
 			rdata[1].next = NULL;
 
 			/*
+			 * add record for the buffer without actual content thats removed if
+			 * fpw is done for that buffer
+			 */
+			if (need_tuple_data)
+			{
+				rdata[1].next = &(rdata[2]);
+
+				rdata[2].data = NULL;
+				rdata[2].len = 0;
+				rdata[2].buffer = buffer;
+				rdata[2].buffer_std = true;
+				rdata[2].next = NULL;
+			}
+
+			/*
 			 * If we're going to reinitialize the whole page using the WAL
 			 * record, hide buffer reference from XLogInsert.
 			 */
 			if (init)
 			{
-				rdata[1].buffer = InvalidBuffer;
+				rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
 				info |= XLOG_HEAP_INIT_PAGE;
 			}
 
@@ -2595,7 +2660,14 @@ l1:
 	{
 		xl_heap_delete xlrec;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[2];
+		XLogRecData rdata[4];
+
+		bool need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL &&
+			RelationGetRelid(relation) >= FirstNormalObjectId;
+
+		/* For logical decode we need combocids to properly decode the catalog */
+		if (wal_level >= WAL_LEVEL_LOGICAL && RelationGetRelid(relation)  < FirstNormalObjectId)
+			log_heap_new_cid(relation, &tp);
 
 		xlrec.all_visible_cleared = all_visible_cleared;
 		xlrec.target.node = relation->rd_node;
@@ -2611,6 +2683,76 @@ l1:
 		rdata[1].buffer_std = true;
 		rdata[1].next = NULL;
 
+		/*
+		 * XXX: We could decide not to log changes when the origin is not the
+		 * local node, that should reduce redundant logging.
+		 */
+		if (need_tuple_data)
+		{
+			xl_heap_header xlhdr;
+
+			Oid indexoid = InvalidOid;
+			int16 pknratts;
+			int16 pkattnum[INDEX_MAX_KEYS];
+			Oid pktypoid[INDEX_MAX_KEYS];
+			Oid pkopclass[INDEX_MAX_KEYS];
+			TupleDesc desc = RelationGetDescr(relation);
+			Relation index_rel;
+			TupleDesc indexdesc;
+			int natt;
+
+			Datum idxvals[INDEX_MAX_KEYS];
+			bool idxisnull[INDEX_MAX_KEYS];
+			HeapTuple idxtuple;
+
+			MemSet(pkattnum, 0, sizeof(pkattnum));
+			MemSet(pktypoid, 0, sizeof(pktypoid));
+			MemSet(pkopclass, 0, sizeof(pkopclass));
+			MemSet(idxvals, 0, sizeof(idxvals));
+			MemSet(idxisnull, 0, sizeof(idxisnull));
+			relationFindPrimaryKey(relation, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+			if (!indexoid)
+			{
+				elog(WARNING, "Could not find primary key for table with oid %u",
+					 RelationGetRelid(relation));
+				goto no_index_found;
+			}
+
+			index_rel = index_open(indexoid, AccessShareLock);
+
+			indexdesc = RelationGetDescr(index_rel);
+
+			for (natt = 0; natt < indexdesc->natts; natt++)
+			{
+				idxvals[natt] =
+					fastgetattr(&tp, pkattnum[natt], desc, &idxisnull[natt]);
+				Assert(!idxisnull[natt]);
+			}
+
+			idxtuple = heap_form_tuple(indexdesc, idxvals, idxisnull);
+
+			xlhdr.t_infomask2 = idxtuple->t_data->t_infomask2;
+			xlhdr.t_infomask = idxtuple->t_data->t_infomask;
+			xlhdr.t_hoff = idxtuple->t_data->t_hoff;
+
+			rdata[1].next = &(rdata[2]);
+			rdata[2].data = (char*)&xlhdr;
+			rdata[2].len = SizeOfHeapHeader;
+			rdata[2].buffer = InvalidBuffer;
+			rdata[2].next = NULL;
+
+			rdata[2].next = &(rdata[3]);
+			rdata[3].data = (char *) idxtuple->t_data + offsetof(HeapTupleHeaderData, t_bits);
+			rdata[3].len = idxtuple->t_len - offsetof(HeapTupleHeaderData, t_bits);
+			rdata[3].buffer = InvalidBuffer;
+			rdata[3].next = NULL;
+
+			heap_close(index_rel, NoLock);
+		no_index_found:
+			;
+		}
+
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE, rdata);
 
 		PageSetLSN(page, recptr);
@@ -3203,10 +3345,20 @@ l2:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
-		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
-											 newbuf, heaptup,
-											 all_visible_cleared,
-											 all_visible_cleared_new);
+		XLogRecPtr	recptr;
+
+		/* For logical decode we need combocids to properly decode the catalog */
+		if (wal_level >= WAL_LEVEL_LOGICAL &&
+			RelationGetRelid(relation)  < FirstNormalObjectId)
+		{
+			log_heap_new_cid(relation, &oldtup);
+			log_heap_new_cid(relation, heaptup);
+		}
+
+		recptr = log_heap_update(relation, buffer, oldtup.t_self,
+								 newbuf, heaptup,
+								 all_visible_cleared,
+								 all_visible_cleared_new);
 
 		if (newbuf != buffer)
 		{
@@ -4445,9 +4597,15 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	xl_heap_header xlhdr;
 	uint8		info;
 	XLogRecPtr	recptr;
-	XLogRecData rdata[4];
+	XLogRecData rdata[5];
 	Page		page = BufferGetPage(newbuf);
 
+	/*
+	 * Just as for XLOG_HEAP_INSERT we need to make sure the tuple
+	 */
+	bool        need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL
+		&& RelationGetRelid(reln) >= FirstNormalObjectId;
+
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 
@@ -4478,28 +4636,44 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	xlhdr.t_hoff = newtup->t_data->t_hoff;
 
 	/*
-	 * As with insert records, we need not store the rdata[2] segment if we
-	 * decide to store the whole buffer instead.
+	 * As with insert's logging , we need not store the the Datum containing
+	 * tuples separately from the buffer if we do logical replication that
+	 * is...
 	 */
 	rdata[2].data = (char *) &xlhdr;
 	rdata[2].len = SizeOfHeapHeader;
-	rdata[2].buffer = newbuf;
+	rdata[2].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].buffer = newbuf;
+	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
 
+	/*
+	 * separate storage for the buffer reference of the new page in the
+	 * wal_level >= logical case
+	*/
+	if(need_tuple_data)
+	{
+		rdata[3].next = &(rdata[4]);
+
+		rdata[4].data = NULL,
+		rdata[4].len = 0;
+		rdata[4].buffer = newbuf;
+		rdata[4].buffer_std = true;
+		rdata[4].next = NULL;
+	}
+
 	/* If new tuple is the single and first tuple on page... */
 	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
 		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 	{
 		info |= XLOG_HEAP_INIT_PAGE;
-		rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
+		rdata[2].buffer = rdata[3].buffer = rdata[4].buffer = InvalidBuffer;
 	}
 
 	recptr = XLogInsert(RM_HEAP_ID, info, rdata);
@@ -4608,6 +4782,64 @@ log_newpage_buffer(Buffer buffer)
 }
 
 /*
+ * Perform XLogInsert of a XLOG_HEAP2_NEW_CID record
+ *
+ * The HeapTuple really needs to already have a ComboCid set otherwise we
+ * cannot detect combocid/cmin/cmax.
+ *
+ * This is only used in wal_level >= WAL_LEVEL_LOGICAL
+ */
+static XLogRecPtr
+log_heap_new_cid(Relation relation, HeapTuple tup)
+{
+	xl_heap_new_cid xlrec;
+
+	XLogRecPtr	recptr;
+	XLogRecData rdata[1];
+	HeapTupleHeader hdr = tup->t_data;
+
+	Assert(ItemPointerIsValid(&tup->t_self));
+	Assert(tup->t_tableOid != InvalidOid);
+
+	xlrec.top_xid = GetTopTransactionId();
+	xlrec.target.node = relation->rd_node;
+	xlrec.target.tid = tup->t_self;
+
+	if (hdr->t_infomask & HEAP_COMBOCID)
+	{
+		xlrec.cmin = HeapTupleHeaderGetCmin(hdr);
+		xlrec.cmax = HeapTupleHeaderGetCmax(hdr);
+		xlrec.combocid = HeapTupleHeaderGetRawCommandId(hdr);
+	}
+	else
+	{
+		/* tuple inserted */
+		if (hdr->t_infomask & HEAP_XMAX_INVALID)
+		{
+			xlrec.cmin = HeapTupleHeaderGetRawCommandId(hdr);
+			xlrec.cmax = InvalidCommandId;
+		}
+		/* tuple from a different tx updated or deleted */
+		else
+		{
+			xlrec.cmin = InvalidCommandId;
+			xlrec.cmax = HeapTupleHeaderGetRawCommandId(hdr);
+
+		}
+		xlrec.combocid = InvalidCommandId;
+	}
+
+	rdata[0].data = (char *) &xlrec;
+	rdata[0].len = SizeOfHeapNewCid;
+	rdata[0].buffer = InvalidBuffer;
+	rdata[0].next = NULL;
+
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_NEW_CID, rdata);
+
+	return recptr;
+}
+
+/*
  * Handles CLEANUP_INFO
  */
 static void
@@ -5676,6 +5908,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_MULTI_INSERT:
 			heap_xlog_multi_insert(lsn, record);
 			break;
+		case XLOG_HEAP2_NEW_CID:
+			/* nothing to do on a real replay, only during logical decoding */
+			break;
 		default:
 			elog(PANIC, "heap2_redo: unknown op code %u", info);
 	}
@@ -5825,6 +6060,15 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 				xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode,
 						 xlrec->blkno, xlrec->ntuples);
 	}
+	else if (info == XLOG_HEAP2_NEW_CID)
+	{
+		xl_heap_new_cid *xlrec = (xl_heap_new_cid *) rec;
+
+		appendStringInfo(buf, "new_cid: ");
+		out_target(buf, &(xlrec->target));
+		appendStringInfo(buf, "; cmin: %u, cmax: %u, combo: %u",
+						 xlrec->cmin, xlrec->cmax, xlrec->combocid);
+	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1749f46..e6fb04e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -106,6 +106,7 @@ const struct config_enum_entry wal_level_options[] = {
 	{"minimal", WAL_LEVEL_MINIMAL, false},
 	{"archive", WAL_LEVEL_ARCHIVE, false},
 	{"hot_standby", WAL_LEVEL_HOT_STANDBY, false},
+	{"logical", WAL_LEVEL_LOGICAL, false},
 	{NULL, 0, false}
 };
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 18d0c5a..1d86b87 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -50,6 +50,7 @@
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parser.h"
+#include "parser/parse_relation.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -3426,3 +3427,76 @@ ResetReindexPending(void)
 {
 	pendingReindexedIndexes = NIL;
 }
+
+/*
+ * relationFindPrimaryKey
+ *		Find primary key for a relation if it exists.
+ *
+ * If no primary key is found *indexOid is set to InvalidOid
+ *
+ * This is quite similar to tablecmd.c's transformFkeyGetPrimaryKey.
+ *
+ * XXX: It might be a good idea to change pg_class.relhaspkey into an bool to
+ * make this more efficient.
+ */
+void
+relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                       int16 *nratts, int16 *attnums, Oid *atttypids,
+                       Oid *opclasses){
+	List *indexoidlist;
+	ListCell *indexoidscan;
+	HeapTuple indexTuple = NULL;
+	Datum indclassDatum;
+	bool isnull;
+	oidvector  *indclass;
+	int i;
+	Form_pg_index indexStruct = NULL;
+
+	*indexOid = InvalidOid;
+
+	indexoidlist = RelationGetIndexList(pkrel);
+
+	foreach(indexoidscan, indexoidlist)
+	{
+		Oid indexoid = lfirst_oid(indexoidscan);
+
+		indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indexoid));
+		if(!HeapTupleIsValid(indexTuple))
+			elog(ERROR, "cache lookup failed for index %u", indexoid);
+
+		indexStruct = (Form_pg_index) GETSTRUCT(indexTuple);
+		if(indexStruct->indisprimary && indexStruct->indimmediate)
+		{
+			*indexOid = indexoid;
+			break;
+		}
+		ReleaseSysCache(indexTuple);
+
+	}
+	list_free(indexoidlist);
+
+	if (!OidIsValid(*indexOid))
+		return;
+
+	/* Must get indclass the hard way */
+	indclassDatum = SysCacheGetAttr(INDEXRELID, indexTuple,
+									Anum_pg_index_indclass, &isnull);
+	Assert(!isnull);
+	indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+	*nratts = indexStruct->indnatts;
+	/*
+	 * Now build the list of PK attributes from the indkey definition (we
+	 * assume a primary key cannot have expressional elements)
+	 */
+	for (i = 0; i < indexStruct->indnatts; i++)
+	{
+		int			pkattno = indexStruct->indkey.values[i];
+
+		attnums[i] = pkattno;
+		atttypids[i] = attnumTypeId(pkrel, pkattno);
+		opclasses[i] = indclass->values[i];
+	}
+
+	ReleaseSysCache(indexTuple);
+}
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 2dde011..2e13e27 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -17,6 +17,8 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
 OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
 	repl_gram.o syncrep.o
 
+SUBDIRS = logical
+
 include $(top_srcdir)/src/backend/common.mk
 
 # repl_scanner is compiled as part of repl_gram
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
new file mode 100644
index 0000000..cf040ef
--- /dev/null
+++ b/src/backend/replication/logical/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/replication/logical
+#
+# IDENTIFICATION
+#    src/backend/replication/logical/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/logical
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
+
+OBJS = decode.o logicalfuncs.o reorderbuffer.o snapbuild.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
new file mode 100644
index 0000000..15d261b
--- /dev/null
+++ b/src/backend/replication/logical/decode.c
@@ -0,0 +1,496 @@
+/*-------------------------------------------------------------------------
+ *
+ * decode.c
+ *		Decodes wal records from an xlogreader.h callback into an reorderbuffer
+ *		while building an appropriate snapshots to decode those
+ *
+ * NOTE:
+ * Its possible that the separation between decode.c and snapbuild.c is a
+ * bit too strict, in the end they just about have the same switch.
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/logical/decode.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/heapam_xlog.h"
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xact.h"
+#include "access/xlogreader.h"
+
+#include "catalog/pg_control.h"
+
+#include "replication/reorderbuffer.h"
+#include "replication/decode.h"
+#include "replication/snapbuild.h"
+#include "replication/logicalfuncs.h"
+
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+
+static void DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple);
+
+static void DecodeInsert(ReorderBuffer *cache, XLogRecordBuffer *buf);
+
+static void DecodeUpdate(ReorderBuffer *cache, XLogRecordBuffer *buf);
+
+static void DecodeDelete(ReorderBuffer *cache, XLogRecordBuffer *buf);
+
+static void DecodeMultiInsert(ReorderBuffer *cache, XLogRecordBuffer *buf);
+
+static void DecodeCommit(ReaderApplyState *state, XLogRecordBuffer *buf, TransactionId xid,
+						 TransactionId *sub_xids, int nsubxacts);
+
+static void DecodeAbort(ReorderBuffer * cache, XLogRecPtr lsn, TransactionId xid,
+						TransactionId *sub_xids, int nsubxacts);
+
+
+void DecodeRecordIntoReorderBuffer(XLogReaderState *reader,
+								   ReaderApplyState *state,
+								   XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+	uint8 info = r->xl_info & ~XLR_INFO_MASK;
+	ReorderBuffer *reorder = state->reorderbuffer;
+	SnapBuildAction action;
+
+	/*
+	 * FIXME: The existance of the snapshot builder is pretty obvious to the
+	 * outside right now, that doesn't seem to be very good...
+	 */
+	if (!state->snapstate)
+	{
+		state->snapstate = AllocateSnapshotBuilder(reorder);
+	}
+
+	/*---------
+	 * Call the snapshot builder. It needs to be called before we analyze
+	 * tuples for two reasons:
+	 *
+	 * * Only in the snapshot building logic we know whether we have enough
+	 *   information to decode a particular tuple
+	 *
+	 * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+	 *   added to the ReorderBuffer before we add tuples using them
+	 *---------
+	 */
+	action = SnapBuildDecodeCallback(reorder, state->snapstate, buf);
+
+	if (state->stop_after_consistent && state->snapstate->state == SNAPBUILD_CONSISTENT)
+	{
+		//Assert(action == SNAPBUILD_SKIP);
+		reader->stop_at_record_boundary = true;
+		elog(WARNING, "reached consistent point, stopping!");
+		return;
+	}
+
+	if (action == SNAPBUILD_SKIP)
+		return;
+
+	switch (r->xl_rmid)
+	{
+		case RM_HEAP_ID:
+			{
+				info &= XLOG_HEAP_OPMASK;
+				switch (info)
+				{
+					case XLOG_HEAP_INSERT:
+						DecodeInsert(reorder, buf);
+						break;
+
+						/*
+						 * no guarantee that we get an HOT update again, so
+						 * handle it as a normal update
+						 */
+					case XLOG_HEAP_HOT_UPDATE:
+					case XLOG_HEAP_UPDATE:
+						DecodeUpdate(reorder, buf);
+						break;
+
+					case XLOG_HEAP_NEWPAGE:
+						/*
+						 * XXX: There doesn't seem to be a usecase for decoding
+						 * HEAP_NEWPAGE's. Its only used in various indexam's
+						 * and CLUSTER, neither of which should be relevant for
+						 * the logical changestream.
+						 */
+						break;
+
+					case XLOG_HEAP_DELETE:
+						DecodeDelete(reorder, buf);
+						break;
+					default:
+						break;
+				}
+				break;
+			}
+		case RM_HEAP2_ID:
+			{
+				info &= XLOG_HEAP_OPMASK;
+				switch (info)
+				{
+					case XLOG_HEAP2_MULTI_INSERT:
+						DecodeMultiInsert(reorder, buf);
+						break;
+					default:
+						/*
+						 * everything else here is just physical stuff were not
+						 * interested in
+						 */
+						break;
+				}
+				break;
+			}
+
+		case RM_XACT_ID:
+			{
+				switch (info)
+				{
+					case XLOG_XACT_COMMIT:
+						{
+							TransactionId *sub_xids;
+							xl_xact_commit *xlrec =
+								(xl_xact_commit *) buf->record_data;
+
+							/*
+							 * FIXME: theoretically computing this address is
+							 * not really allowed if there are no
+							 * subtransactions
+							 */
+							sub_xids = (TransactionId *) &(
+								xlrec->xnodes[xlrec->nrels]);
+
+							DecodeCommit(state, buf, r->xl_xid,
+										 sub_xids, xlrec->nsubxacts);
+
+
+							break;
+						}
+					case XLOG_XACT_COMMIT_PREPARED:
+						{
+							TransactionId *sub_xids;
+							xl_xact_commit_prepared *xlrec =
+								(xl_xact_commit_prepared*) buf->record_data;
+
+							sub_xids = (TransactionId *) &(
+								xlrec->crec.xnodes[xlrec->crec.nrels]);
+
+							DecodeCommit(state, buf, r->xl_xid, sub_xids,
+										 xlrec->crec.nsubxacts);
+
+							break;
+						}
+					case XLOG_XACT_COMMIT_COMPACT:
+						{
+							xl_xact_commit_compact *xlrec =
+								(xl_xact_commit_compact *) buf->record_data;
+							DecodeCommit(state, buf, r->xl_xid,
+										 xlrec->subxacts, xlrec->nsubxacts);
+							break;
+						}
+					case XLOG_XACT_ABORT:
+						{
+							TransactionId *sub_xids;
+							xl_xact_abort *xlrec =
+								(xl_xact_abort *) buf->record_data;
+
+							sub_xids = (TransactionId *) &(
+								xlrec->xnodes[xlrec->nrels]);
+
+							DecodeAbort(reorder, r->xl_xid, buf->origptr,
+										sub_xids, xlrec->nsubxacts);
+							break;
+						}
+					case XLOG_XACT_ABORT_PREPARED:
+						{
+							TransactionId *sub_xids;
+							xl_xact_abort_prepared *xlrec =
+								(xl_xact_abort_prepared *)buf->record_data;
+							xl_xact_abort *arec = &xlrec->arec;
+
+							sub_xids = (TransactionId *) &(
+								arec->xnodes[arec->nrels]);
+
+							DecodeAbort(reorder, xlrec->xid, buf->origptr,
+										sub_xids, arec->nsubxacts);
+							/* XXX: any reason for also aborting r->xl_xid? */
+							break;
+						}
+
+					case XLOG_XACT_ASSIGNMENT:
+						/*
+						 * XXX: We could reassign transactions to the parent
+						 * here to save space and effort when merging
+						 * transactions at commit.
+						 */
+						break;
+					case XLOG_XACT_PREPARE:
+						/*
+						 * FXIME: we should replay the transaction and prepare
+						 * it as well.
+						 */
+						break;
+					default:
+						break;
+						;
+				}
+				break;
+			}
+		case RM_XLOG_ID:
+			{
+				switch (info)
+				{
+					/* this is also used in END_OF_RECOVERY checkpoints */
+					case XLOG_CHECKPOINT_SHUTDOWN:
+						/*
+						 * abort all transactions that still are in progress,
+						 * they aren't in progress anymore.  do not abort
+						 * prepared transactions that have been prepared for
+						 * commit.
+						 *
+						 * FIXME: implement.
+						 */
+						break;
+				}
+			}
+		default:
+			break;
+	}
+}
+
+static void
+DecodeCommit(ReaderApplyState *state, XLogRecordBuffer *buf, TransactionId xid,
+			 TransactionId *sub_xids, int nsubxacts)
+{
+	int i;
+
+	/* not interested in that part of the stream */
+	if (XLByteLE(buf->origptr, state->snapstate->transactions_after))
+	{
+		DecodeAbort(state->reorderbuffer, buf->origptr, xid,
+					sub_xids, nsubxacts);
+		return;
+	}
+
+	for (i = 0; i < nsubxacts; i++)
+	{
+		ReorderBufferCommitChild(state->reorderbuffer, xid, *sub_xids,
+								 buf->origptr);
+		sub_xids++;
+	}
+
+	/* replay actions of all transaction + subtransactions in order */
+	ReorderBufferCommit(state->reorderbuffer, xid, buf->origptr);
+}
+
+static void
+DecodeAbort(ReorderBuffer *reorder, XLogRecPtr lsn, TransactionId xid,
+			TransactionId *sub_xids, int nsubxacts)
+{
+	int i;
+
+	elog(WARNING, "ABORT %u", xid);
+
+	for (i = 0; i < nsubxacts; i++)
+	{
+		ReorderBufferAbort(reorder, *sub_xids, lsn);
+		sub_xids++;
+	}
+
+	ReorderBufferAbort(reorder, xid, lsn);
+}
+
+static void
+DecodeInsert(ReorderBuffer *reorder, XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+	xl_heap_insert *xlrec = (xl_heap_insert *) buf->record_data;
+
+	ReorderBufferChange *change;
+
+	if (r->xl_info & XLR_BKP_BLOCK(0)
+		&& r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader))
+	{
+		elog(DEBUG2, "huh, no tuple data on wal_level = logical?");
+		return;
+	}
+
+	change = ReorderBufferGetChange(reorder);
+	change->action = REORDER_BUFFER_CHANGE_INSERT;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	change->newtuple = ReorderBufferGetTupleBuf(reorder);
+
+	DecodeXLogTuple((char *) xlrec + SizeOfHeapInsert,
+					r->xl_len - SizeOfHeapInsert,
+					change->newtuple);
+
+	ReorderBufferAddChange(reorder, r->xl_xid, buf->origptr, change);
+}
+
+static void
+DecodeUpdate(ReorderBuffer *reorder, XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+	xl_heap_update *xlrec = (xl_heap_update *) buf->record_data;
+
+
+	ReorderBufferChange *change;
+
+	if ((r->xl_info & XLR_BKP_BLOCK(0) || r->xl_info & XLR_BKP_BLOCK(1)) &&
+		(r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader)))
+	{
+		elog(DEBUG2, "huh, no tuple data on wal_level = logical?");
+		return;
+	}
+
+	change = ReorderBufferGetChange(reorder);
+	change->action = REORDER_BUFFER_CHANGE_UPDATE;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	/*
+	 * FIXME: need to get/save the old tuple as well if we want primary key
+	 * changes to work.
+	 */
+	change->newtuple = ReorderBufferGetTupleBuf(reorder);
+
+	DecodeXLogTuple((char *) xlrec + SizeOfHeapUpdate,
+					r->xl_len - SizeOfHeapUpdate,
+					change->newtuple);
+
+	ReorderBufferAddChange(reorder, r->xl_xid, buf->origptr, change);
+}
+
+static void
+DecodeDelete(ReorderBuffer *reorder, XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+
+	xl_heap_delete *xlrec = (xl_heap_delete *) buf->record_data;
+
+	ReorderBufferChange *change;
+
+	change = ReorderBufferGetChange(reorder);
+	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+	if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader))
+	{
+		elog(DEBUG2, "huh, no primary key for a delete on wal_level = logical?");
+		return;
+	}
+
+	change->oldtuple = ReorderBufferGetTupleBuf(reorder);
+
+	DecodeXLogTuple((char *) xlrec + SizeOfHeapDelete,
+					r->xl_len - SizeOfHeapDelete,
+					change->oldtuple);
+
+	ReorderBufferAddChange(reorder, r->xl_xid, buf->origptr, change);
+}
+
+/*
+ * Decode xl_heap_multi_insert record into multiple changes.
+ *
+ * Due to slightly different layout we can't reuse DecodeXLogTuple without
+ * making that even harder to understand than already is.
+ */
+static void
+DecodeMultiInsert(ReorderBuffer *reorder, XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+	xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *)buf->record_data;
+	int i;
+	char *data = buf->record_data;
+	bool		isinit = (r->xl_info & XLOG_HEAP_INIT_PAGE) != 0;
+
+	data += SizeOfHeapMultiInsert;
+
+	/* OffsetNumber's are only stored if its not a HEAP_INIT_PAGE record */
+	if (!isinit)
+		data += sizeof(OffsetNumber) * xlrec->ntuples;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		ReorderBufferChange *change;
+		xl_multi_insert_tuple *xlhdr;
+		int datalen;
+		ReorderBufferTupleBuf *tuple;
+
+		change = ReorderBufferGetChange(reorder);
+		change->action = REORDER_BUFFER_CHANGE_INSERT;
+		change->newtuple = ReorderBufferGetTupleBuf(reorder);
+		memcpy(&change->relnode, &xlrec->node, sizeof(RelFileNode));
+
+		tuple = change->newtuple;
+		/* not a disk based tuple */
+		ItemPointerSetInvalid(&tuple->tuple.t_self);
+
+		xlhdr = (xl_multi_insert_tuple *) SHORTALIGN(data);
+		data = ((char *) xlhdr) + SizeOfMultiInsertTuple;
+		datalen = xlhdr->datalen;
+
+		/* we can only figure this out after reassembling the transactions */
+		tuple->tuple.t_tableOid = InvalidOid;
+		tuple->tuple.t_data = &tuple->header;
+		tuple->tuple.t_len = datalen + offsetof(HeapTupleHeaderData, t_bits);
+
+		memset(&tuple->header, 0, sizeof(HeapTupleHeaderData));
+
+		memcpy((char *) &tuple->header + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) data,
+			   datalen);
+		data += datalen;
+
+		tuple->header.t_infomask = xlhdr->t_infomask;
+		tuple->header.t_infomask2 = xlhdr->t_infomask2;
+		tuple->header.t_hoff = xlhdr->t_hoff;
+
+		ReorderBufferAddChange(reorder, r->xl_xid, buf->origptr, change);
+	}
+}
+
+
+static void
+DecodeXLogTuple(char *data, Size len, ReorderBufferTupleBuf *tuple)
+{
+	xl_heap_header xlhdr;
+	int datalen = len - SizeOfHeapHeader;
+
+	Assert(datalen >= 0);
+	Assert(datalen <= MaxHeapTupleSize);
+
+	tuple->tuple.t_len = datalen + offsetof(HeapTupleHeaderData, t_bits);
+
+	/* not a disk based tuple */
+	ItemPointerSetInvalid(&tuple->tuple.t_self);
+
+	/* we can only figure this out after reassembling the transactions */
+	tuple->tuple.t_tableOid = InvalidOid;
+	tuple->tuple.t_data = &tuple->header;
+
+	/* data is not stored aligned */
+	memcpy((char *) &xlhdr,
+		   data,
+		   SizeOfHeapHeader);
+
+	memset(&tuple->header, 0, sizeof(HeapTupleHeaderData));
+
+	memcpy((char *) &tuple->header + offsetof(HeapTupleHeaderData, t_bits),
+		   data + SizeOfHeapHeader,
+		   datalen);
+
+	tuple->header.t_infomask = xlhdr.t_infomask;
+	tuple->header.t_infomask2 = xlhdr.t_infomask2;
+	tuple->header.t_hoff = xlhdr.t_hoff;
+}
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
new file mode 100644
index 0000000..41f2ec8
--- /dev/null
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -0,0 +1,247 @@
+/*-------------------------------------------------------------------------
+ *
+ * logicalfuncs.c
+ *
+ *     Support functions for using xlog decoding
+ *
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/logicalfuncs.c
+ *
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+
+#include "replication/logicalfuncs.h"
+
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+
+#include "replication/reorderbuffer.h"
+#include "replication/decode.h"
+/*FIXME: XLogRead*/
+#include "replication/walsender_private.h"
+#include "replication/snapbuild.h"
+
+#include "utils/lsyscache.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+
+/*
+ * XLogReader callbacks
+ */
+static bool
+replay_record_is_interesting(XLogReaderState* state, XLogRecord* r)
+{
+	return true;
+}
+
+static void
+replay_writeout_data(XLogReaderState* state, char* data, Size len)
+{
+	return;
+}
+
+static void
+replay_finished_record(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+	ReaderApplyState* apply_state = state->private_data;
+	DecodeRecordIntoReorderBuffer(state, apply_state, buf);
+}
+
+static void
+replay_read_page(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+	XLogPageHeader page_header;
+
+	Assert((startptr % XLOG_BLCKSZ) == 0);
+
+	/* FIXME: more sensible/efficient implementation */
+	XLogRead(cur_page, startptr, XLOG_BLCKSZ);
+
+	page_header = (XLogPageHeader)cur_page;
+
+	if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+	{
+		elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+		     XLOG_PAGE_MAGIC, (uint32)(startptr >> 32), (uint32)startptr);
+	}
+}
+
+/*
+ * Callbacks for ReorderBuffer which add in some more information and then call
+ * output_plugin.h plugins.
+ */
+static void
+begin_txn_wrapper(ReorderBuffer* cache, ReorderBufferTXN* txn)
+{
+	ReaderApplyState *state = cache->private_data;
+	bool send;
+
+	resetStringInfo(state->out);
+	WalSndPrepareWrite(state->out, txn->lsn);
+
+	send = state->begin_cb(state->user_private, state->out, txn);
+
+	if (send)
+	{
+		WalSndWriteData(state->out);
+	}
+}
+
+static void
+commit_txn_wrapper(ReorderBuffer* cache, ReorderBufferTXN* txn, XLogRecPtr commit_lsn)
+{
+	ReaderApplyState *state = cache->private_data;
+	bool send;
+
+	resetStringInfo(state->out);
+	WalSndPrepareWrite(state->out, commit_lsn);
+
+	send = state->commit_cb(state->user_private, state->out, txn, commit_lsn);
+
+	if (send)
+	{
+		WalSndWriteData(state->out);
+	}
+}
+
+static void
+change_wrapper(ReorderBuffer* cache, ReorderBufferTXN* txn, ReorderBufferChange* change)
+{
+	ReaderApplyState *state = cache->private_data;
+	bool send;
+	HeapTuple table;
+	Oid reloid;
+
+	resetStringInfo(state->out);
+	WalSndPrepareWrite(state->out, change->lsn);
+
+	table = LookupTableByRelFileNode(&change->relnode);
+	Assert(table);
+	reloid = HeapTupleHeaderGetOid(table->t_data);
+	ReleaseSysCache(table);
+
+
+	send = state->change_cb(state->user_private, state->out, txn,
+							reloid, change);
+
+	if (send)
+	{
+		WalSndWriteData(state->out);
+	}
+}
+
+/*
+ * Build a snapshot reader that doesn't ever outputs/decodes anything, but just
+ * waits for the first point in the LSN stream where it reaches a consistent
+ * state.
+ */
+XLogReaderState *
+initial_snapshot_reader()
+{
+	ReorderBuffer *reorder;
+	XLogReaderState *xlogreader_state = XLogReaderAllocate();
+	ReaderApplyState *apply_state;
+
+	xlogreader_state->is_record_interesting = replay_record_is_interesting;
+	xlogreader_state->finished_record = replay_finished_record;
+	xlogreader_state->writeout_data = replay_writeout_data;
+	xlogreader_state->read_page = replay_read_page;
+	xlogreader_state->private_data = calloc(1, sizeof(ReaderApplyState));
+
+	if (!xlogreader_state->private_data)
+		elog(ERROR, "Could not allocate the ReaderApplyState struct");
+
+	apply_state = (ReaderApplyState*)xlogreader_state->private_data;
+
+	reorder = ReorderBufferAllocate();
+	reorder->begin = NULL; /* not decoding yet */
+	reorder->apply_change = NULL;
+	reorder->commit = NULL;
+	reorder->private_data = xlogreader_state->private_data;
+
+	apply_state->reorderbuffer = reorder;
+	apply_state->stop_after_consistent = true;
+
+	return xlogreader_state;
+}
+
+/*
+ * Build a snapshot reader with callbacks found in the shared library "plugin"
+ * under the symbol names found in output_plugin.h.
+ * It wraps those callbacks so they send out their changes via an logical
+ * walsender.
+ */
+XLogReaderState *
+normal_snapshot_reader(char *plugin, XLogRecPtr valid_after)
+{
+	ReorderBuffer *reorder;
+	XLogReaderState *xlogreader_state = XLogReaderAllocate();
+	ReaderApplyState *apply_state;
+
+	xlogreader_state->is_record_interesting = replay_record_is_interesting;
+	xlogreader_state->finished_record = replay_finished_record;
+	xlogreader_state->writeout_data = replay_writeout_data;
+	xlogreader_state->read_page = replay_read_page;
+	xlogreader_state->private_data = calloc(1, sizeof(ReaderApplyState));
+
+	if (!xlogreader_state->private_data)
+		elog(ERROR, "Could not allocate the ReaderApplyState struct");
+
+	apply_state = (ReaderApplyState*)xlogreader_state->private_data;
+
+	reorder = ReorderBufferAllocate();
+
+	apply_state->reorderbuffer = reorder;
+	apply_state->stop_after_consistent = false;
+
+	apply_state->snapstate = AllocateSnapshotBuilder(reorder);
+	apply_state->snapstate->transactions_after = valid_after;
+
+	reorder->begin = begin_txn_wrapper;
+	reorder->apply_change = change_wrapper;
+	reorder->commit = commit_txn_wrapper;
+
+	/* lookup symbols in the shared libarary */
+
+	/* optional */
+	apply_state->init_cb = (LogicalDecodeInitCB)
+		load_external_function(plugin, "pg_decode_init", false, NULL);
+
+	apply_state->begin_cb = (LogicalDecodeBeginCB)
+		load_external_function(plugin, "pg_decode_begin_txn", true, NULL);
+
+	apply_state->change_cb = (LogicalDecodeChangeCB)
+		load_external_function(plugin, "pg_decode_change", true, NULL);
+
+	apply_state->commit_cb = (LogicalDecodeCommitCB)
+		load_external_function(plugin, "pg_decode_commit_txn", true, NULL);
+
+	/* optional */
+	apply_state->cleanup_cb = (LogicalDecodeCleanupCB)
+		load_external_function(plugin, "pg_decode_clean", false, NULL);
+
+	reorder->private_data = xlogreader_state->private_data;
+
+	apply_state->out = makeStringInfo();
+
+	/* initialize output plugin */
+	if (apply_state->init_cb)
+		apply_state->init_cb(&apply_state->user_private);
+
+	return xlogreader_state;
+}
+
+/* has the initial snapshot found a consistent state? */
+bool
+initial_snapshot_ready(XLogReaderState *reader)
+{
+	ReaderApplyState* state = reader->private_data;
+	return state->snapstate->state == SNAPBUILD_CONSISTENT;
+}
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
new file mode 100644
index 0000000..b80b054
--- /dev/null
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -0,0 +1,1156 @@
+/*-------------------------------------------------------------------------
+ *
+ * reorderbuffer.c
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/reorderbuffer.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/xact.h"
+
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+
+#include "lib/simpleheap.h"
+
+#include "replication/reorderbuffer.h"
+#include "replication/snapbuild.h"
+
+#include "storage/sinval.h"
+#include "storage/bufmgr.h"
+
+#include "utils/builtins.h"
+#include "utils/combocid.h"
+#include "utils/memutils.h"
+#include "utils/relcache.h"
+#include "utils/tqual.h"
+#include "utils/syscache.h"
+
+
+
+const Size max_memtries = 1 << 16;
+
+const size_t max_cached_changes = 1024;
+const size_t max_cached_tuplebufs = 1024; /* ~8MB */
+const size_t max_cached_transactions = 512;
+
+/* entry for a hash table we use to map from xid to our transaction state */
+typedef struct ReorderBufferTXNByIdEnt
+{
+	TransactionId xid;
+	ReorderBufferTXN *txn;
+}  ReorderBufferTXNByIdEnt;
+
+typedef struct ReorderBufferTupleCidKey
+{
+	RelFileNode relnode;
+	ItemPointerData tid;
+} ReorderBufferTupleCidKey;
+
+typedef struct ReorderBufferTupleCidEnt
+{
+	ReorderBufferTupleCidKey key;
+	CommandId cmin;
+	CommandId cmax;
+	CommandId combocid;
+} ReorderBufferTupleCidEnt;
+
+/*
+ * For efficiency and simplicity reasons we want to keep Snapshots, CommandIds
+ * and ComboCids in the same list with the user visible INSERT/UPDATE/DELETE
+ * changes. We don't want to leak those internal values to external users
+ * though (they would just use switch()...default:) because that would make it
+ * harder to add to new user visible values.
+ *
+ * This needs to be synchronized with ReorderBufferChangeType! Adjust the
+ * StatisAssertExpr's in ReorderBufferAllocate if you add anything!
+ */
+enum ReorderBufferChangeTypeInternal
+{
+	REORDER_BUFFER_CHANGE_INTERNAL_INSERT,
+	REORDER_BUFFER_CHANGE_INTERNAL_UPDATE,
+	REORDER_BUFFER_CHANGE_INTERNAL_DELETE,
+	REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT,
+	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
+	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID
+};
+
+/* Get an unused, but potentially cached, ReorderBufferTXN entry */
+static ReorderBufferTXN *ReorderBufferGetTXN(ReorderBuffer *cache);
+
+/* Return an unused ReorderBufferTXN entry */
+static void ReorderBufferReturnTXN(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
+static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *cache, TransactionId xid,
+                                         bool create, bool *is_new);
+
+
+/*
+ * support functions for lsn-order iterating over the ->changes of a
+ * transaction and its subtransactions
+ */
+
+/*
+ * used for iteration over the k-way heap merge of a transaction and its
+ * subtransactions
+ */
+typedef struct ReorderBufferIterTXNState
+{
+	simpleheap *heap;
+} ReorderBufferIterTXNState;
+
+/* allocate & initialize an iterator */
+static ReorderBufferIterTXNState *
+ReorderBufferIterTXNInit(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
+/* get the next change */
+static ReorderBufferChange *
+ReorderBufferIterTXNNext(ReorderBuffer *cache, ReorderBufferIterTXNState *state);
+
+/* deallocate iterator */
+static void
+ReorderBufferIterTXNFinish(ReorderBuffer *cache, ReorderBufferIterTXNState *state);
+
+/* where to put this? */
+static void
+ReorderBufferProcessInvalidations(ReorderBuffer *cache, ReorderBufferTXN *txn);
+
+ReorderBuffer *
+ReorderBufferAllocate(void)
+{
+	ReorderBuffer *cache = (ReorderBuffer *) malloc(sizeof(ReorderBuffer));
+	HASHCTL hash_ctl;
+
+	StaticAssertExpr((int)REORDER_BUFFER_CHANGE_INTERNAL_INSERT == (int)REORDER_BUFFER_CHANGE_INSERT, "out of sync enums");
+	StaticAssertExpr((int)REORDER_BUFFER_CHANGE_INTERNAL_UPDATE == (int)REORDER_BUFFER_CHANGE_UPDATE, "out of sync enums");
+	StaticAssertExpr((int)REORDER_BUFFER_CHANGE_INTERNAL_DELETE == (int)REORDER_BUFFER_CHANGE_DELETE, "out of sync enums");
+
+	if (!cache)
+		elog(ERROR, "Could not allocate the ReorderBuffer");
+
+	cache->build_snapshots = true;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	cache->context = AllocSetContextCreate(TopMemoryContext,
+	                                       "ReorderBuffer",
+	                                       ALLOCSET_DEFAULT_MINSIZE,
+	                                       ALLOCSET_DEFAULT_INITSIZE,
+	                                       ALLOCSET_DEFAULT_MAXSIZE);
+
+	hash_ctl.keysize = sizeof(TransactionId);
+	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
+	hash_ctl.hash = tag_hash;
+	hash_ctl.hcxt = cache->context;
+
+	cache->by_txn = hash_create("ReorderBufferByXid", 1000, &hash_ctl,
+	                            HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+	cache->nr_cached_transactions = 0;
+	cache->nr_cached_changes = 0;
+	cache->nr_cached_tuplebufs = 0;
+
+	dlist_init(&cache->cached_transactions);
+	dlist_init(&cache->cached_changes);
+	slist_init(&cache->cached_tuplebufs);
+
+	return cache;
+}
+
+/*
+ * Free a ReorderBuffer
+ */
+void
+ReorderBufferFree(ReorderBuffer *cache)
+{
+	/* FIXME: check for in-progress transactions */
+	/* FIXME: clean up cached transaction */
+	/* FIXME: clean up cached changes */
+	/* FIXME: clean up cached tuplebufs */
+	hash_destroy(cache->by_txn);
+	free(cache);
+}
+
+/*
+ * Get a unused, possibly preallocated, ReorderBufferTXN.
+ */
+static ReorderBufferTXN *
+ReorderBufferGetTXN(ReorderBuffer *cache)
+{
+	ReorderBufferTXN *txn;
+
+	if (cache->nr_cached_transactions)
+	{
+		cache->nr_cached_transactions--;
+		txn = dlist_container(ReorderBufferTXN, node,
+		                      dlist_pop_head_node(&cache->cached_transactions));
+	}
+	else
+	{
+		txn = (ReorderBufferTXN *)
+			malloc(sizeof(ReorderBufferTXN));
+
+		if (!txn)
+			elog(ERROR, "Could not allocate a ReorderBufferTXN struct");
+	}
+
+	memset(txn, 0, sizeof(ReorderBufferTXN));
+
+	dlist_init(&txn->changes);
+	dlist_init(&txn->tuplecids);
+	dlist_init(&txn->subtxns);
+
+	return txn;
+}
+
+/*
+ * Free an ReorderBufferTXN. Deallocation might be delayed for efficiency
+ * purposes.
+ */
+void
+ReorderBufferReturnTXN(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	if (txn->tuplecid_hash != NULL)
+	{
+		hash_destroy(txn->tuplecid_hash);
+		txn->tuplecid_hash = NULL;
+	}
+
+	if (txn->invalidations)
+	{
+		free(txn->invalidations);
+		txn->invalidations = NULL;
+	}
+
+
+	if (cache->nr_cached_transactions < max_cached_transactions)
+	{
+		cache->nr_cached_transactions++;
+		dlist_push_head(&cache->cached_transactions, &txn->node);
+	}
+	else
+	{
+		free(txn);
+	}
+}
+
+/*
+ * Get a unused, possibly preallocated, ReorderBufferChange.
+ */
+ReorderBufferChange *
+ReorderBufferGetChange(ReorderBuffer *cache)
+{
+	ReorderBufferChange *change;
+
+	if (cache->nr_cached_changes)
+	{
+		cache->nr_cached_changes--;
+		change = dlist_container(ReorderBufferChange, node,
+								 dlist_pop_head_node(&cache->cached_changes));
+	}
+	else
+	{
+		change = (ReorderBufferChange *) malloc(sizeof(ReorderBufferChange));
+
+		if (!change)
+			elog(ERROR, "Could not allocate a ReorderBufferChange struct");
+	}
+
+
+	memset(change, 0, sizeof(ReorderBufferChange));
+	return change;
+}
+
+/*
+ * Free an ReorderBufferChange. Deallocation might be delayed for efficiency
+ * purposes.
+ */
+void
+ReorderBufferReturnChange(ReorderBuffer *cache, ReorderBufferChange *change)
+{
+	switch ((enum ReorderBufferChangeTypeInternal)change->action_internal)
+	{
+		case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+		case REORDER_BUFFER_CHANGE_INTERNAL_UPDATE:
+		case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
+			if (change->newtuple)
+			{
+				ReorderBufferReturnTupleBuf(cache, change->newtuple);
+				change->newtuple = NULL;
+			}
+
+			if (change->oldtuple)
+			{
+				ReorderBufferReturnTupleBuf(cache, change->oldtuple);
+				change->oldtuple = NULL;
+			}
+			break;
+		case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+			if (change->snapshot)
+			{
+				SnapBuildSnapDecRefcount(change->snapshot);
+				change->snapshot = NULL;
+			}
+		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+			break;
+		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+			break;
+	}
+
+	if (cache->nr_cached_changes < max_cached_changes)
+	{
+		cache->nr_cached_changes++;
+		dlist_push_head(&cache->cached_changes, &change->node);
+	}
+	else
+	{
+		free(change);
+	}
+}
+
+
+/*
+ * Get a unused, possibly preallocated, ReorderBufferTupleBuf
+ */
+ReorderBufferTupleBuf *
+ReorderBufferGetTupleBuf(ReorderBuffer *cache)
+{
+	ReorderBufferTupleBuf *tuple;
+
+	if (cache->nr_cached_tuplebufs)
+	{
+		cache->nr_cached_tuplebufs--;
+		tuple = slist_container(ReorderBufferTupleBuf, node,
+		                        slist_pop_head_node(&cache->cached_tuplebufs));
+	}
+	else
+	{
+		tuple =
+			(ReorderBufferTupleBuf *) malloc(sizeof(ReorderBufferTupleBuf));
+
+		if (!tuple)
+			elog(ERROR, "Could not allocate a ReorderBufferTupleBuf struct");
+	}
+
+	return tuple;
+}
+
+/*
+ * Free an ReorderBufferTupleBuf. Deallocation might be delayed for efficiency
+ * purposes.
+ */
+void
+ReorderBufferReturnTupleBuf(ReorderBuffer *cache, ReorderBufferTupleBuf *tuple)
+{
+	if (cache->nr_cached_tuplebufs < max_cached_tuplebufs)
+	{
+		cache->nr_cached_tuplebufs++;
+		slist_push_head(&cache->cached_tuplebufs, &tuple->node);
+	}
+	else
+	{
+		free(tuple);
+	}
+}
+
+/*
+ * Access the transactions being processed via xid. Optionally create a new
+ * entry.
+ */
+static
+ReorderBufferTXN *
+ReorderBufferTXNByXid(ReorderBuffer *cache, TransactionId xid, bool create, bool *is_new)
+{
+	ReorderBufferTXNByIdEnt *ent;
+	bool found;
+
+	/* FIXME: add one entry fast-path cache */
+
+	ent = (ReorderBufferTXNByIdEnt *)
+		hash_search(cache->by_txn,
+		            (void *)&xid,
+		            (create ? HASH_ENTER : HASH_FIND),
+		            &found);
+
+	if (found)
+	{
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "found cache entry for %u at %p", xid, ent);
+#endif
+	}
+	else
+	{
+#ifdef VERBOSE_DEBUG
+		elog(LOG, "didn't find cache entry for %u in %p at %p, creating %u",
+		     xid, cache, ent, create);
+#endif
+	}
+
+	if (!found && !create)
+		return NULL;
+
+	if (!found)
+	{
+		ent->txn = ReorderBufferGetTXN(cache);
+		ent->txn->xid = xid;
+	}
+
+	if (is_new)
+		*is_new = !found;
+
+	return ent->txn;
+}
+
+/*
+ * Queue a change into a transaction so it can be replayed uppon commit.
+ */
+void
+ReorderBufferAddChange(ReorderBuffer *cache, TransactionId xid, XLogRecPtr lsn,
+                    ReorderBufferChange *change)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+	txn->lsn = lsn;
+	dlist_push_tail(&txn->changes, &change->node);
+}
+
+
+/*
+ * Associate a subtransaction with its toplevel transaction.
+ */
+void
+ReorderBufferCommitChild(ReorderBuffer *cache, TransactionId xid,
+						 TransactionId subxid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn;
+	ReorderBufferTXN *subtxn;
+
+	subtxn = ReorderBufferTXNByXid(cache, subxid, false, NULL);
+
+	/*
+	 * No need to do anything if that subtxn didn't contain any changes
+	 */
+	if (!subtxn)
+		return;
+
+	subtxn->lsn = lsn;
+
+	txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+
+	dlist_push_tail(&txn->subtxns, &subtxn->node);
+	txn->nsubtxns++;
+}
+
+
+/*
+ * Support for efficiently iterating over a transaction's and its
+ * subtransactions' changes.
+ *
+ * We do by doing aa k-way merge between transactions/subtransactions. For that
+ * we model the current heads of the different transactions as a binary heap so
+ * we easily know which (sub-)transaction has the change with the smalles lsn
+ * next.
+ * Luckily the changes in individual transactions are already sorted by LSN.
+ */
+
+/*
+ * Binary heap comparison function.
+ */
+static int
+ReorderBufferIterCompare(simpleheap_kv *a, simpleheap_kv *b)
+{
+	ReorderBufferChange *change_a = dlist_container(ReorderBufferChange, node,
+													(dlist_node*)a->key);
+	ReorderBufferChange *change_b = dlist_container(ReorderBufferChange, node,
+													(dlist_node*)b->key);
+
+	if (change_a->lsn < change_b->lsn)
+		return -1;
+
+	else if (change_a->lsn == change_b->lsn)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Initialize an iterator which iterates in lsn order over a transaction and
+ * all its subtransactions.
+ */
+static ReorderBufferIterTXNState *
+ReorderBufferIterTXNInit(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	size_t nr_txns = 0; /* main txn */
+	ReorderBufferIterTXNState *state;
+	dlist_iter cur_txn_i;
+	ReorderBufferTXN *cur_txn;
+	ReorderBufferChange *cur_change;
+
+	if (!dlist_is_empty(&txn->changes))
+		nr_txns++;
+
+	/* count how large our heap must be */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (!dlist_is_empty(&cur_txn->changes))
+			nr_txns++;
+	}
+
+	/*
+	 * FIXME: Add fastpath for the rather common nr_txns=1 case, no need to
+	 * allocate/build a heap in that case.
+	 */
+
+	/* allocate iteration state */
+	state = calloc(1, sizeof(ReorderBufferIterTXNState));
+
+	/* allocate heap */
+	state->heap = simpleheap_allocate(nr_txns);
+	state->heap->compare = ReorderBufferIterCompare;
+
+	/*
+	 * fill array with elements, heap condition not yet fullfilled. Properly
+	 * building the heap afterwards is more efficient.
+	 */
+
+	/* add toplevel transaction if it contains changes*/
+	if (!dlist_is_empty(&txn->changes))
+	{
+		cur_change = dlist_head_element(ReorderBufferChange, node, &txn->changes);
+
+		simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+	}
+
+	/* add subtransactions if they contain changes */
+	dlist_foreach(cur_txn_i, &txn->subtxns)
+	{
+		cur_txn = dlist_container(ReorderBufferTXN, node, cur_txn_i.cur);
+
+		if (!dlist_is_empty(&cur_txn->changes))
+		{
+			cur_change = dlist_head_element(ReorderBufferChange, node,
+											&cur_txn->changes);
+
+			simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+		}
+	}
+
+	/* make the array fullfill the heap property */
+	simpleheap_build(state->heap);
+	return state;
+}
+
+/*
+ * Return the next change when iterating over a transaction and its
+ * subtransaction.
+ */
+static ReorderBufferChange *
+ReorderBufferIterTXNNext(ReorderBuffer *cache, ReorderBufferIterTXNState *state)
+{
+	ReorderBufferTXN *txn = NULL;
+	ReorderBufferChange *change;
+	simpleheap_kv *kv;
+
+	/* nothing there anymore */
+	if (state->heap->size == 0)
+		return NULL;
+
+	kv = simpleheap_first(state->heap);
+
+	change = dlist_container(ReorderBufferChange, node, (dlist_node*)kv->key);
+
+	txn = (ReorderBufferTXN *) kv->value;
+
+	if (!dlist_has_next(&txn->changes, &change->node))
+	{
+		simpleheap_remove_first(state->heap);
+	}
+	else
+	{
+		simpleheap_change_key(state->heap, change->node.next);
+	}
+	return change;
+}
+
+/*
+ * Deallocate the iterator
+ */
+static void
+ReorderBufferIterTXNFinish(ReorderBuffer *cache, ReorderBufferIterTXNState *state)
+{
+	simpleheap_free(state->heap);
+	free(state);
+}
+
+
+/*
+ * Cleanup the contents of a transaction, usually after the transaction
+ * committed or aborted.
+ */
+static void
+ReorderBufferCleanupTXN(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	bool found;
+	dlist_mutable_iter cur_change;
+	dlist_mutable_iter cur_txn;
+
+	/* cleanup subtransactions & their changes */
+	dlist_foreach_modify(cur_txn, &txn->subtxns)
+	{
+		ReorderBufferTXN *subtxn = dlist_container(ReorderBufferTXN, node, cur_txn.cur);
+
+		dlist_foreach_modify(cur_change, &subtxn->changes)
+		{
+			ReorderBufferChange *change =
+				dlist_container(ReorderBufferChange, node, cur_change.cur);
+
+			ReorderBufferReturnChange(cache, change);
+		}
+		ReorderBufferReturnTXN(cache, subtxn);
+	}
+
+	/* cleanup changes in the toplevel txn */
+	dlist_foreach_modify(cur_change, &txn->changes)
+	{
+		ReorderBufferChange *change =
+			dlist_container(ReorderBufferChange, node, cur_change.cur);
+
+		ReorderBufferReturnChange(cache, change);
+	}
+
+	/*
+	 * cleanup the tuplecids we stored timetravel access. They are always
+	 * stored in the toplevel transaction.
+	 */
+	dlist_foreach_modify(cur_change, &txn->tuplecids)
+	{
+		ReorderBufferChange *change =
+			dlist_container(ReorderBufferChange, node, cur_change.cur);
+		Assert(change->action_internal == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+		ReorderBufferReturnChange(cache, change);
+	}
+
+	/* now remove reference from cache */
+	hash_search(cache->by_txn,
+	            (void *)&txn->xid,
+	            HASH_REMOVE,
+	            &found);
+	Assert(found);
+
+	ReorderBufferReturnTXN(cache, txn);
+}
+
+/*
+ * Build a hash with a (relfilenode, itempoint) -> (cmin, cmax) mapping for use
+ * by tqual.c's HeapTupleSatisfiesMVCCDuringDecoding.
+ */
+static void
+ReorderBufferBuildTupleCidHash(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	dlist_iter cur_change;
+	HASHCTL hash_ctl;
+
+	if (!txn->does_timetravel || dlist_is_empty(&txn->tuplecids))
+		return;
+
+	memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+	hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
+	hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
+	hash_ctl.hash = tag_hash;
+	hash_ctl.hcxt = cache->context;
+
+	/*
+	 * create the hash with the exact number of to-be-stored tuplecids from the
+	 * start
+	 */
+	txn->tuplecid_hash =
+		hash_create("ReorderBufferTupleCid", txn->ntuplecids, &hash_ctl,
+					HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+	dlist_foreach(cur_change, &txn->tuplecids)
+	{
+		ReorderBufferTupleCidKey key;
+		ReorderBufferTupleCidEnt *ent;
+		bool found;
+		ReorderBufferChange *change =
+			dlist_container(ReorderBufferChange, node, cur_change.cur);
+
+		Assert(change->action_internal == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
+
+		/* be careful about padding */
+		memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
+
+		key.relnode = change->tuplecid.node;
+
+		ItemPointerCopy(&change->tuplecid.tid,
+						&key.tid);
+
+		ent = (ReorderBufferTupleCidEnt *)
+			hash_search(txn->tuplecid_hash,
+						(void *)&key,
+						HASH_ENTER|HASH_FIND,
+						&found);
+		if (!found)
+		{
+			ent->cmin = change->tuplecid.cmin;
+			ent->cmax = change->tuplecid.cmax;
+			ent->combocid = change->tuplecid.combocid;
+		}
+		else
+		{
+			Assert(ent->cmin == change->tuplecid.cmin);
+			Assert(ent->cmax == InvalidCommandId ||
+				   ent->cmax == change->tuplecid.cmax);
+			/*
+			 * if the tuple got valid in this transaction and now got deleted
+			 * we already have a valid cmin stored. The cmax will be
+			 * InvalidCommandId though.
+			 */
+			ent->cmax = change->tuplecid.cmax;
+		}
+	}
+}
+
+/*
+ * Copy a provided snapshot so we can modify it privately. This is needed so
+ * that catalog modifying transactions can look into intermediate catalog
+ * states.
+ */
+static Snapshot
+ReorderBufferCopySnap(ReorderBuffer *cache, Snapshot orig_snap,
+					  ReorderBufferTXN *txn, CommandId cid)
+{
+	Snapshot snap;
+	dlist_iter sub_txn_i;
+	ReorderBufferTXN *sub_txn;
+	int i = 0;
+	Size size = sizeof(SnapshotData) +
+		sizeof(TransactionId) * orig_snap->xcnt +
+		sizeof(TransactionId) * (txn->nsubtxns + 1)
+		;
+
+	elog(DEBUG1, "copying a non-transaction-specific snapshot into timetravel tx %u", txn->xid);
+
+	/* we only want to start with snapshots as provided by snapbuild.c */
+	Assert(!orig_snap->subxip);
+	Assert(!orig_snap->copied);
+
+	snap = calloc(1, size);
+	memcpy(snap, orig_snap, sizeof(SnapshotData));
+
+	snap->copied = true;
+	snap->active_count = 0;
+	snap->regd_count = 0;
+	snap->xip = (TransactionId *) (snap + 1);
+
+	memcpy(snap->xip, orig_snap->xip, sizeof(TransactionId) * snap->xcnt);
+
+	/*
+	 * ->subxip contains all txids that belong to our transaction which we need
+	 * to check via cmin/cmax. Thats why we store the toplevel transaction in
+	 * there as well.
+	 */
+	snap->subxip = snap->xip + snap->xcnt;
+	snap->subxip[i++] = txn->xid;
+	snap->subxcnt = txn->nsubtxns + 1;
+
+	dlist_foreach(sub_txn_i, &txn->subtxns)
+	{
+		sub_txn = dlist_container(ReorderBufferTXN, node, sub_txn_i.cur);
+		snap->subxip[i++] = sub_txn->xid;
+	}
+
+	/* bsearch()ability */
+	qsort(snap->subxip, snap->subxcnt,
+		  sizeof(TransactionId), xidComparator);
+
+	/*
+	 * store the specified current CommandId
+	 */
+	snap->curcid = cid;
+
+	return snap;
+}
+
+/*
+ * Free a previously ReorderBufferCopySnap'ed snapshot
+ */
+static void
+ReorderBufferFreeSnap(ReorderBuffer *cache, Snapshot snap)
+{
+	Assert(snap->copied);
+	free(snap);
+}
+
+/*
+ * Commit a transaction and replay all actions that previously have been
+ * ReorderBufferAddChange'd in the toplevel TX or any of the subtransactions
+ * assigned via ReorderBufferCommitChild.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *cache, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, false, NULL);
+	ReorderBufferIterTXNState *iterstate = NULL;
+	ReorderBufferChange *change;
+	CommandId command_id = FirstCommandId;
+	Snapshot snapshot_mvcc = NULL;
+	Snapshot snapshot_now = NULL;
+	bool snapshot_copied = false;
+
+	if (!txn)
+		return;
+
+	txn->lsn = lsn;
+
+	cache->begin(cache, txn);
+
+	PG_TRY();
+	{
+		ReorderBufferBuildTupleCidHash(cache, txn);
+
+		iterstate = ReorderBufferIterTXNInit(cache, txn);
+		while ((change = ReorderBufferIterTXNNext(cache, iterstate)))
+		{
+			switch ((enum ReorderBufferChangeTypeInternal)change->action_internal)
+			{
+				case REORDER_BUFFER_CHANGE_INTERNAL_INSERT:
+				case REORDER_BUFFER_CHANGE_INTERNAL_UPDATE:
+				case REORDER_BUFFER_CHANGE_INTERNAL_DELETE:
+					Assert(snapshot_mvcc != NULL);
+					if (!SnapBuildHasCatalogChanges(NULL, xid, &change->relnode))
+						cache->apply_change(cache, txn, change);
+					break;
+				case REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT:
+					/* XXX: we could skip snapshots in non toplevel txns */
+					if (snapshot_copied)
+					{
+						ReorderBufferFreeSnap(cache, snapshot_now);
+						snapshot_now = ReorderBufferCopySnap(cache, change->snapshot,
+															 txn, command_id);
+					}
+					else
+					{
+						snapshot_now = change->snapshot;
+					}
+
+					/*
+					 * the first snapshot seen in a transaction is its mvcc
+					 * snapshot
+					 */
+					if (!snapshot_mvcc)
+						snapshot_mvcc = snapshot_now;
+					else
+						RevertFromDecodingSnapshots();
+
+					SetupDecodingSnapshots(snapshot_now, txn->tuplecid_hash);
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
+					if (!snapshot_copied && snapshot_now)
+					{
+						/* we don't use the global one anymore */
+						snapshot_copied = true;
+						snapshot_now = ReorderBufferCopySnap(cache, snapshot_now,
+														  txn, command_id);
+					}
+
+					command_id = Max(command_id, change->command_id);
+
+					if (snapshot_now && command_id != InvalidCommandId)
+					{
+						snapshot_now->curcid = command_id;
+
+						RevertFromDecodingSnapshots();
+						SetupDecodingSnapshots(snapshot_now, txn->tuplecid_hash);
+					}
+
+					/*
+					 * everytime the CommandId is incremented, we could see new
+					 * catalog contents
+					 */
+					ReorderBufferProcessInvalidations(cache, txn);
+
+					break;
+
+				case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
+					elog(ERROR, "tuplecid value in normal queue");
+					break;
+			}
+		}
+
+		ReorderBufferIterTXNFinish(cache, iterstate);
+
+		cache->commit(cache, txn, lsn);
+
+		RevertFromDecodingSnapshots();
+		ReorderBufferProcessInvalidations(cache, txn);
+
+		ReorderBufferCleanupTXN(cache, txn);
+
+
+		if (snapshot_copied)
+		{
+			ReorderBufferFreeSnap(cache, snapshot_now);
+		}
+	}
+	PG_CATCH();
+	{
+		if (iterstate)
+			ReorderBufferIterTXNFinish(cache, iterstate);
+
+		/*
+		 * XXX: do we want to do this here?
+		 * ReorderBufferCleanupTXN(cache, txn);
+		 */
+
+		RevertFromDecodingSnapshots();
+		ReorderBufferProcessInvalidations(cache, txn);
+
+		if (snapshot_copied)
+		{
+			ReorderBufferFreeSnap(cache, snapshot_now);
+		}
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Abort a transaction that possibly has previous changes. Needs to be done
+ * independently for toplevel and subtransactions.
+ */
+void
+ReorderBufferAbort(ReorderBuffer *cache, TransactionId xid, XLogRecPtr lsn)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, false, NULL);
+
+	/* no changes in this commit */
+	if (!txn)
+		return;
+
+	ReorderBufferCleanupTXN(cache, txn);
+}
+
+/*
+ * Check whether a transaction is already known in this module
+ */
+bool
+ReorderBufferIsXidKnown(ReorderBuffer *cache, TransactionId xid)
+{
+	bool is_new;
+	/*
+	 * FIXME: for efficiency reasons we create the xid here, that doesn't seem
+	 * like a good idea though
+	 */
+	ReorderBufferTXNByXid(cache, xid, true, &is_new);
+
+	/* no changes in this commit */
+	return !is_new;
+}
+
+/*
+ * Add a new snapshot to this transaction which is the "base" of snapshots we
+ * modify if this is a catalog modifying transaction.
+ */
+void
+ReorderBufferAddBaseSnapshot(ReorderBuffer *cache, TransactionId xid,
+							 XLogRecPtr lsn, Snapshot snap)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+	ReorderBufferChange *change = ReorderBufferGetChange(cache);
+
+	change->snapshot = snap;
+	change->action_internal = REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT;
+
+	if (lsn == InvalidXLogRecPtr)
+		txn->has_base_snapshot = true;
+
+	ReorderBufferAddChange(cache, xid, lsn, change);
+}
+
+/*
+ * Access the catalog with this CommandId at this point in the changestream.
+ */
+void
+ReorderBufferAddNewCommandId(ReorderBuffer *cache, TransactionId xid,
+							 XLogRecPtr lsn, CommandId cid)
+{
+	ReorderBufferChange *change = ReorderBufferGetChange(cache);
+
+	change->command_id = cid;
+	change->action_internal = REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID;
+
+	ReorderBufferAddChange(cache, xid, lsn, change);
+}
+
+
+/*
+ * Add new (relfilenode, tid) -> (cmin, cmax) mappings.
+ */
+void
+ReorderBufferAddNewTupleCids(ReorderBuffer *cache, TransactionId xid, XLogRecPtr lsn,
+							 RelFileNode node, ItemPointerData tid,
+							 CommandId cmin, CommandId cmax, CommandId combocid)
+{
+	ReorderBufferChange *change = ReorderBufferGetChange(cache);
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+
+	change->tuplecid.node = node;
+	change->tuplecid.tid = tid;
+	change->tuplecid.cmin = cmin;
+	change->tuplecid.cmax = cmax;
+	change->tuplecid.combocid = combocid;
+	change->lsn = lsn;
+	change->action_internal = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+	dlist_push_tail(&txn->tuplecids, &change->node);
+	txn->ntuplecids++;
+}
+
+/*
+ * Setup the invalidation of the toplevel transaction.
+ *
+ * This needs to be done before ReorderBufferCommit is called!
+ */
+void
+ReorderBufferAddInvalidations(ReorderBuffer *cache, TransactionId xid, XLogRecPtr lsn,
+                                Size nmsgs, SharedInvalidationMessage* msgs)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+
+	if (txn->ninvalidations)
+		elog(ERROR, "only ever add one set of invalidations");
+	/* FIXME: free */
+	txn->invalidations = malloc(sizeof(SharedInvalidationMessage) * nmsgs);
+
+	if (!txn->invalidations)
+		elog(ERROR, "could not allocate memory for invalidations");
+
+	memcpy(txn->invalidations, msgs, sizeof(SharedInvalidationMessage) * nmsgs);
+	txn->ninvalidations = nmsgs;
+}
+
+/*
+ * Apply all invalidations we know. Possibly we only need parts at this point
+ * in the changestream but we don't know which those are.
+ */
+static void
+ReorderBufferProcessInvalidations(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+	int i;
+	for (i = 0; i < txn->ninvalidations; i++)
+	{
+		LocalExecuteInvalidationMessage(&txn->invalidations[i]);
+	}
+}
+
+/*
+ * Mark a transaction as doing timetravel.
+ */
+void
+ReorderBufferXidSetTimetravel(ReorderBuffer *cache, TransactionId xid)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+	txn->does_timetravel = true;
+}
+
+/*
+ * Query whether a transaction is *known* to be doing timetravel. This can be
+ * wrong until directly before the commit!
+ */
+bool
+ReorderBufferXidDoesTimetravel(ReorderBuffer *cache, TransactionId xid)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+	return txn->does_timetravel;
+}
+
+/*
+ * Have we already added the first snapshot?
+ */
+bool
+ReorderBufferXidHasBaseSnapshot(ReorderBuffer *cache, TransactionId xid)
+{
+	ReorderBufferTXN *txn = ReorderBufferTXNByXid(cache, xid, true, NULL);
+	return txn->has_base_snapshot;
+}
+
+/*
+ * Visibility support routines
+ */
+
+/*-------------------------------------------------------------------------
+ * Lookup actual cmin/cmax values during timetravel access. We can't always
+ * rely on stored cmin/cmax values because of two scenarios:
+ *
+ * * A tuple got changed multiple times during a single transaction and thus
+ *   has got a combocid. Combocid's are only valid for the duration of a single
+ *   transaction.
+ * * A tuple with a cmin but no cmax (and thus no combocid) got deleted/updated
+ *   in another transaction than the one which created it which we are looking
+ *   at right now. As only one of cmin, cmax or combocid is actually stored in
+ *   the heap we don't have access to the the value we need anymore.
+ *
+ * To resolve those problems we have a per-transaction hash of (cmin, cmax)
+ * tuples keyed by (relfilenode, ctid) which contains the actual (cmin, cmax)
+ * values. That also takes care of combocids by simply not caring about them at
+ * all. As we have the real cmin/cmax values thats enough.
+ *
+ * As we only care about catalog tuples here the overhead of this hashtable
+ * should be acceptable.
+ * -------------------------------------------------------------------------
+ */
+extern bool
+ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data,
+							  HeapTuple htup, Buffer buffer,
+							  CommandId *cmin, CommandId *cmax)
+{
+	ReorderBufferTupleCidKey key;
+	ReorderBufferTupleCidEnt* ent;
+	ForkNumber forkno;
+	BlockNumber blockno;
+
+	/* be careful about padding */
+	memset(&key, 0, sizeof(key));
+
+	Assert(!BufferIsLocal(buffer));
+
+	/*
+	 * get relfilenode from the buffer, no convenient way to access it other
+	 * than that.
+	 */
+	BufferGetTag(buffer, &key.relnode, &forkno, &blockno);
+
+	/* tuples can only be in the main fork */
+	Assert(forkno == MAIN_FORKNUM);
+	Assert(blockno == ItemPointerGetBlockNumber(&htup->t_self));
+
+	ItemPointerCopy(&htup->t_self,
+					&key.tid);
+
+	ent = (ReorderBufferTupleCidEnt *)
+		hash_search(tuplecid_data,
+					(void *)&key,
+					HASH_FIND,
+					NULL);
+
+	if (ent == NULL)
+		return false;
+
+	if (cmin)
+		*cmin = ent->cmin;
+	if (cmax)
+		*cmax = ent->cmax;
+	return true;
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
new file mode 100644
index 0000000..df24b33
--- /dev/null
+++ b/src/backend/replication/logical/snapbuild.c
@@ -0,0 +1,1144 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.c
+ *
+ *	  Support for building timetravel snapshots based on the contents of the
+ *	  wal
+ *
+ * NOTES:
+ *
+ * We build snapshots which can *only* be used to read catalog contents by
+ * reading the wal stream. The aim is to provide mvcc and SnapshotNow snapshots
+ * that behave the same as their respective counterparts would have at the time
+ * the XLogRecord was generated. This is done to provide a reliable environment
+ * for decoding those records into every format that pleases the author of an
+ * output plugin.
+ *
+ * To build the snapshots we reuse the infrastructure built for hot
+ * standby. The snapshots we build look different than HS' because we have
+ * different needs. To successfully decode data from the WAL we only need to
+ * access catalogs/(sys|rel|cat)cache, not the actual user tables. And we need
+ * to build multiple, vastly different, ones, without being able to fully rely
+ * on the clog for information about committed transactions because they might
+ * commit in the future from the POV of the wal entry were currently decoding.
+
+ * As the percentage of transactions modifying the catalog normally is fairly
+ * small, instead of keeping track of all running transactions and treating
+ * everything inside (xmin, xmax) thats not known to be running as commited we
+ * do the contrary. That is we keep a list of transactions between
+ * snapshot->(xmin, xmax) that we consider committed, everything else is
+ * considered aborted/in progress.
+ * That also allows us not to care about subtransactions before they have
+ * committed.
+ *
+ * Classic SnapshotNow behaviour - which is mainly used for efficiency, not for
+ * correctness - is not actually required by any of the routines that we need
+ * during decoding and is hard to emulate fully. Instead we build snapshots
+ * with MVCC behaviour that are updated whenever another transaction commits.
+ *
+ * One additional complexity of doing this is that to handle mixed DDL/DML
+ * transactions we need Snapshots that see intermediate states in a
+ * transaction. In normal operation this is achieved by using
+ * CommandIds/cmin/cmax. The problem with this however is that for space
+ * efficiency reasons only one value of that is stored (cf. combocid.c). To get
+ * arround that we log additional information which allows us to get the
+ * original (cmin, cmax) pair during visibility checks.
+ *
+ * To facilitate all this we need our own visibility routine, as the normal
+ * ones are optimized for different usecases. We also need the code to use out
+ * special snapshots automatically whenever SnapshotNow behaviour is expected
+ * (specifying our snapshot everywhere would be far to invasive).
+ *
+ * To replace the normal SnapshotNows snapshots use the SetupDecodingSnapshots
+ * RevertFromDecodingSnapshots functions.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/replication/snapbuild.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/rmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlogreader.h"
+
+#include "catalog/catalog.h"
+#include "catalog/pg_control.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_tablespace.h"
+
+#include "miscadmin.h"
+
+#include "replication/reorderbuffer.h"
+#include "replication/snapbuild.h"
+#include "replication/walsender_private.h"
+
+#include "utils/builtins.h"
+#include "utils/catcache.h"
+#include "utils/inval.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/relmapper.h"
+#include "utils/snapshot.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+#include "storage/block.h" /* debugging output */
+#include "storage/standby.h"
+#include "storage/sinval.h"
+
+/* transaction state manipulation functions */
+static void SnapBuildEndTxn(Snapstate *snapstate, TransactionId xid);
+
+static void SnapBuildAbortTxn(Snapstate *state, TransactionId xid, int nsubxacts,
+							  TransactionId *subxacts);
+
+static void SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,
+							   XLogRecPtr lsn, TransactionId xid,
+							   int nsubxacts, TransactionId *subxacts);
+
+/* ->running manipulation */
+static bool SnapBuildTxnIsRunning(Snapstate *snapstate, TransactionId xid);
+
+/* ->committed manipulation */
+static void SnapBuildPurgeCommittedTxn(Snapstate *snapstate);
+
+/* snapshot building/manipulation/distribution functions */
+/* XXX */
+static Snapshot SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid);
+
+static void	SnapBuildFreeSnapshot(Snapshot snap);
+
+static void SnapBuildSnapIncRefcount(Snapshot snap);
+
+static void SnapBuildDistributeSnapshotNow(Snapstate *snapstate, ReorderBuffer *reorder, XLogRecPtr lsn);
+
+/*
+ * Lookup a table via its current relfilenode.
+ *
+ * This requires that some snapshot in which that relfilenode is actually
+ * visible to be set up.
+ *
+ * The result of this function needs to be released from the syscache.
+ */
+HeapTuple
+LookupTableByRelFileNode(RelFileNode *relfilenode)
+{
+	Oid spc;
+	HeapTuple tuple;
+	Oid heaprel;
+
+	/*
+	 * relations in the default tablespace are stored with a reltablespace = 0
+	 * for some reason.
+	 */
+	spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+		InvalidOid : relfilenode->spcNode;
+
+	tuple = SearchSysCache2(RELFILENODE,
+							spc,
+							relfilenode->relNode);
+
+	if (!HeapTupleIsValid(tuple))
+	{
+		if (relfilenode->spcNode == GLOBALTABLESPACE_OID)
+		{
+			heaprel = RelationMapFilenodeToOid(relfilenode->relNode, true);
+		}
+		else
+		{
+			heaprel = RelationMapFilenodeToOid(relfilenode->relNode, false);
+		}
+
+		if (heaprel != InvalidOid)
+		{
+			tuple = SearchSysCache1(RELOID,
+									heaprel);
+		}
+	}
+	return tuple;
+}
+
+/*
+ * Does this relation carry catalog information? Important for knowing whether
+ * a transaction made changes to the catalog, in which case it need to be
+ * included in snapshots.
+ *
+ * Requires that an appropriate timetravel snapshot is set up!
+ */
+bool
+SnapBuildHasCatalogChanges(Snapstate *snapstate, TransactionId xid, RelFileNode *relfilenode)
+{
+	HeapTuple table;
+	Form_pg_class class_form;
+	bool ret;
+
+	if (relfilenode->spcNode == GLOBALTABLESPACE_OID)
+		return true;
+
+
+	table = LookupTableByRelFileNode(relfilenode);
+
+	/*
+	 * tables in the default tablespace are stored in pg_class with 0 as their
+	 * reltablespace
+	 */
+	if (!HeapTupleIsValid(table))
+	{
+		elog(FATAL, "failed pg_class lookup for %u:%u",
+			 relfilenode->spcNode, relfilenode->relNode);
+		return false;
+	}
+
+	class_form = (Form_pg_class) GETSTRUCT(table);
+	ret = IsSystemClass(class_form);
+
+	ReleaseSysCache(table);
+	return ret;
+}
+
+/*
+ * Allocate a new snapshot builder.
+ */
+Snapstate *
+AllocateSnapshotBuilder(ReorderBuffer *reorder)
+{
+	Snapstate *snapstate = malloc(sizeof(Snapstate));
+
+	snapstate->state = SNAPBUILD_START;
+
+	snapstate->nrrunning = 0;
+	snapstate->nrrunning_initial = 0;
+	snapstate->running = NULL;
+
+	snapstate->nrcommitted = 0;
+	snapstate->nrcommitted_space = 128; /* arbitrary number */
+	snapstate->committed = malloc(snapstate->nrcommitted_space * sizeof(TransactionId));
+	snapstate->transactions_after = InvalidXLogRecPtr;
+
+	if (!snapstate->committed)
+		elog(ERROR, "could not allocate memory for snapstate->committed");
+
+	snapstate->snapshot = NULL;
+
+	return snapstate;
+}
+
+/*
+ * Freesnapshot builder.
+ */
+void
+FreeSnapshotBuilder(Snapstate *snapstate)
+{
+	if (snapstate->snapshot)
+		SnapBuildFreeSnapshot(snapstate->snapshot);
+
+	if (snapstate->committed)
+		free(snapstate->committed);
+
+	if (snapstate->running)
+		free(snapstate->running);
+
+	free(snapstate);
+}
+
+/*
+ * Free an unreferenced snapshot that has previously been built by us.
+ */
+static void
+SnapBuildFreeSnapshot(Snapshot snap)
+{
+	/* make sure we don't get passed an external snapshot */
+	Assert(snap->satisfies == HeapTupleSatisfiesMVCCDuringDecoding);
+
+	/* make sure nobody modified our snapshot */
+	Assert(snap->curcid == FirstCommandId);
+	Assert(!snap->suboverflowed);
+	Assert(!snap->takenDuringRecovery);
+	Assert(!snap->regd_count);
+
+	/* slightly more likely, so its checked even without casserts */
+	if (snap->copied)
+		elog(ERROR, "we don't deal with copied snapshots here.");
+
+	if (snap->active_count)
+		elog(ERROR, "freeing active snapshot");
+
+	free(snap);
+}
+
+/*
+ * Increase refcount of a snapshot.
+ *
+ * This is used when handing out a snapshot to some external resource or when
+ * adding a Snapshot as snapstate->snapshot.
+ */
+static void
+SnapBuildSnapIncRefcount(Snapshot snap)
+{
+	snap->active_count++;
+}
+
+/*
+ * Decrease refcount of a snapshot and free if the refcount reaches zero.
+ *
+ * Externally visible so external resources that have been handed an IncRef'ed
+ * Snapshot can free it easily.
+ */
+void
+SnapBuildSnapDecRefcount(Snapshot snap)
+{
+	/* make sure we don't get passed an external snapshot */
+	Assert(snap->satisfies == HeapTupleSatisfiesMVCCDuringDecoding);
+
+	/* make sure nobody modified our snapshot */
+	Assert(snap->curcid == FirstCommandId);
+	Assert(!snap->suboverflowed);
+	Assert(!snap->takenDuringRecovery);
+	Assert(!snap->regd_count);
+
+	Assert(snap->active_count);
+
+	/* slightly more likely, so its checked even without casserts */
+	if (snap->copied)
+		elog(ERROR, "we don't deal with copied snapshots here.");
+
+	snap->active_count--;
+	if (!snap->active_count)
+		SnapBuildFreeSnapshot(snap);
+}
+
+/*
+ * Build a new snapshot, based on currently committed, catalog modifying
+ * transactions.
+ *
+ * In-Progress transaction with catalog access are *not* allowed to modify
+ * these snapshots, they have to copy them and fill in appropriate ->curcid and
+ * ->subxip/subxcnt values.
+ */
+static Snapshot
+SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid)
+{
+	Snapshot snapshot = malloc(sizeof(SnapshotData) +
+							   sizeof(TransactionId) * snapstate->nrcommitted +
+							   sizeof(TransactionId) * 1 /* toplevel xid */);
+
+	snapshot->satisfies = HeapTupleSatisfiesMVCCDuringDecoding;
+	/*
+	 * we copy all currently in progress transaction to ->xip, all
+	 * transactions added to the transaction that committed during running -
+	 * which thus need to be considered visible in SnapshotNow semantics - get
+	 * copied to ->subxip.
+	 *
+	 * XXX: Do we want extra fields for those two instead?
+	 */
+	snapshot->xmin = snapstate->xmin;
+	snapshot->xmax = snapstate->xmax;
+
+	/* store all transaction to be treated as committed */
+	snapshot->xip = (TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
+	snapshot->xcnt = snapstate->nrcommitted;
+	memcpy(snapshot->xip, snapstate->committed,
+	       snapstate->nrcommitted * sizeof(TransactionId));
+	/* sort so we can bsearch() */
+	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+
+
+	snapshot->subxcnt = 0;
+	snapshot->subxip = NULL;
+
+	snapshot->suboverflowed = false;
+	snapshot->takenDuringRecovery = false;
+	snapshot->copied = false;
+	snapshot->curcid = FirstCommandId;
+	snapshot->active_count = 0;
+	snapshot->regd_count = 0;
+
+	return snapshot;
+}
+
+/*
+ * Handle the effects of a single heap change, appropriate to the current state
+ * of the snapshot builder.
+ */
+static SnapBuildAction
+SnapBuildProcessChange(ReorderBuffer *reorder, Snapstate *snapstate,
+					   TransactionId xid, XLogRecordBuffer *buf,
+					   RelFileNode *relfilenode)
+{
+	SnapBuildAction ret = SNAPBUILD_SKIP;
+
+	/*
+	 * We can't handle data in transactions if we haven't built a snapshot yet,
+	 * so don't store them.
+	 */
+	if (snapstate->state < SNAPBUILD_FULL_SNAPSHOT)
+		;
+	/*
+	 * No point in keeping track of changes in transactions that we don't have
+	 * enough information about to decode.
+	 */
+	else if (snapstate->state < SNAPBUILD_CONSISTENT &&
+			 SnapBuildTxnIsRunning(snapstate, xid))
+		;
+	else
+	{
+		bool old_tx = ReorderBufferIsXidKnown(reorder, xid);
+
+		ret = SNAPBUILD_DECODE;
+
+		if (!old_tx || !ReorderBufferXidHasBaseSnapshot(reorder, xid))
+		{
+			if (!snapstate->snapshot) {
+				snapstate->snapshot = SnapBuildBuildSnapshot(snapstate, xid);
+				/* refcount of the snapshot builder */
+				SnapBuildSnapIncRefcount(snapstate->snapshot);
+			}
+
+			/* refcount of the transaction */
+			SnapBuildSnapIncRefcount(snapstate->snapshot);
+			ReorderBufferAddBaseSnapshot(reorder, xid,
+									  InvalidXLogRecPtr,
+									  snapstate->snapshot);
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Process a single xlog record.
+ */
+SnapBuildAction
+SnapBuildDecodeCallback(ReorderBuffer *reorder, Snapstate *snapstate,
+						XLogRecordBuffer *buf)
+{
+	XLogRecord *r = &buf->record;
+	uint8 info = r->xl_info & ~XLR_INFO_MASK;
+	TransactionId xid = buf->record.xl_xid;
+
+	SnapBuildAction ret = SNAPBUILD_SKIP;
+
+#if DEBUG_ME_LOUDLY
+	{
+		StringInfoData s;
+
+		initStringInfo(&s);
+		RmgrTable[r->xl_rmid].rm_desc(&s,
+									  r->xl_info,
+									  buf->record_data);
+
+		/* don't bother emitting empty description */
+		if (s.len > 0)
+			elog(LOG, "xlog redo %u: %s", xid, s.data);
+	}
+#endif
+
+	/*
+	 * Only search for an initial starting point if we haven't build a full
+	 * snapshot yet
+	 */
+	if (snapstate->state <= SNAPBUILD_CONSISTENT)
+	{
+		/*
+		 * Build snapshot incrementally using information about the currently
+		 * running transactions. As soon as all of those have finished
+		 */
+		if (r->xl_rmid == RM_STANDBY_ID &&
+			info == XLOG_RUNNING_XACTS)
+		{
+			xl_running_xacts *running = (xl_running_xacts *) buf->record_data;
+
+			snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+
+			/*
+			 * inrease shared memory state, so vacuum can work
+			 * on tuples we prevent from being purged.
+			 */
+			IncreaseLogicalXminForSlot(buf->origptr, running->oldestRunningXid);
+
+			if (running->xcnt == 0)
+			{
+				/*
+				 * might have already started to incrementally assemble
+				 * transactions.
+				 */
+				snapstate->transactions_after = buf->origptr;
+
+				snapstate->xmin_running = InvalidTransactionId;
+				snapstate->xmax_running = InvalidTransactionId;
+
+				/*
+				 * FIXME: abort everything we have stored about running
+				 * transactions, relevant e.g. after a crash.
+				 */
+				snapstate->state = SNAPBUILD_CONSISTENT;
+			}
+			/* first encounter of a xl_running_xacts record */
+			else if (!snapstate->nrrunning)
+			{
+				/*
+				 * We only care about toplevel xids as those are the ones we
+				 * definitely see in the wal stream. As snapbuild.c tracks
+				 * committed instead of running transactions we don't need to
+				 * know anything about uncommitted subtransactions.
+				 */
+				snapstate->xmin = running->oldestRunningXid;
+				TransactionIdRetreat(snapstate->xmin);
+				snapstate->xmax = running->latestCompletedXid;
+				TransactionIdAdvance(snapstate->xmax);
+
+				snapstate->nrrunning = running->xcnt;
+				snapstate->nrrunning_initial = running->xcnt;
+
+				snapstate->running = malloc(snapstate->nrrunning
+											* sizeof(TransactionId));
+
+				memcpy(snapstate->running, running->xids,
+					   snapstate->nrrunning_initial * sizeof(TransactionId));
+
+				/* sort so we can do a binary search */
+				qsort(snapstate->running, snapstate->nrrunning_initial,
+					  sizeof(TransactionId), xidComparator);
+
+				snapstate->xmin_running = snapstate->running[0];
+				snapstate->xmax_running = snapstate->running[running->xcnt - 1];
+
+				/* makes comparisons cheaper later */
+				TransactionIdRetreat(snapstate->xmin_running);
+				TransactionIdAdvance(snapstate->xmax_running);
+
+				snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+			}
+
+			elog(LOG, "found initial snapshot (via running xacts). Done: %i",
+				 snapstate->state == SNAPBUILD_CONSISTENT);
+		}
+		else if (r->xl_rmid == RM_XLOG_ID &&
+				 (info == XLOG_CHECKPOINT_SHUTDOWN || info == XLOG_CHECKPOINT_ONLINE))
+		{
+			/* FIXME: Check whether there is a valid state dumped to disk */
+		}
+	}
+
+	if (snapstate->state == SNAPBUILD_START)
+		return SNAPBUILD_SKIP;
+
+	/*
+	 * This switch is - partially due to PGs indentation rules - rather deep
+	 * and large. Maybe break it into separate functions?
+	 */
+	switch (r->xl_rmid)
+	{
+		case RM_XLOG_ID:
+			{
+				switch (info)
+				{
+					case XLOG_CHECKPOINT_SHUTDOWN:
+#ifdef NOT_YET
+						{
+							/*
+							 * FIXME: abort everything but prepared xacts, we
+							 * don't track prepared xacts though so far.  It
+							 * might be neccesary to do this to handle subtxn
+							 * ids that haven't been assigned to a toplevel xid
+							 * after a crash.
+							 */
+							for ( /* FIXME */ )
+							{
+							}
+						}
+#endif
+					case XLOG_CHECKPOINT_ONLINE:
+						{
+							/*
+							 * FIXME: dump state to disk so we can restart
+							 * from here later
+							 */
+							break;
+						}
+				}
+				break;
+			}
+		case RM_STANDBY_ID:
+			{
+				switch (info)
+				{
+					case XLOG_RUNNING_XACTS:
+						{
+							xl_running_xacts *running =
+								(xl_running_xacts *) buf->record_data;
+
+							/*
+							 * update range of interesting xids. We don't
+							 * increase ->xmax because once we are in a
+							 * consistent state we can do that ourselves and
+							 * much more efficiently so because we only need to
+							 * do it for catalog transactions.
+							 */
+							snapstate->xmin = running->oldestRunningXid;
+							TransactionIdRetreat(snapstate->xmin);
+
+							/*
+							 * Remove transactions we don't need to keep track
+							 * off anymore.
+							 */
+							SnapBuildPurgeCommittedTxn(snapstate);
+
+							/*
+							 * inrease shared memory state, so vacuum can work
+							 * on tuples we prevent from being purged.
+							 */
+							IncreaseLogicalXminForSlot(buf->origptr, running->oldestRunningXid);
+
+							break;
+						}
+					case XLOG_STANDBY_LOCK:
+						break;
+				}
+				break;
+			}
+		case RM_XACT_ID:
+			{
+				switch (info)
+				{
+					case XLOG_XACT_COMMIT:
+						{
+							xl_xact_commit *xlrec =
+								(xl_xact_commit *) buf->record_data;
+
+							ret = SNAPBUILD_DECODE;
+
+							/*
+							 * Queue cache invalidation messages.
+							 */
+							if (xlrec->nmsgs)
+							{
+								TransactionId *subxacts;
+								SharedInvalidationMessage *inval_msgs;
+
+								/* subxid array follows relfilenodes */
+								subxacts = (TransactionId *)
+									&(xlrec->xnodes[xlrec->nrels]);
+								/* invalidation messages follow subxids */
+								inval_msgs = (SharedInvalidationMessage *)
+									&(subxacts[xlrec->nsubxacts]);
+
+								/*
+								 * no need to check
+								 * XactCompletionRelcacheInitFileInval, we will
+								 * process the sinval messages that the
+								 * relmapper change has generated.
+								 */
+								ReorderBufferAddInvalidations(reorder, xid,
+														   InvalidXLogRecPtr,
+								                           xlrec->nmsgs,
+														   inval_msgs);
+
+								/*
+								 * Let everyone know that this transaction
+								 * modified the catalog. We need this at commit
+								 * time.
+								 */
+								ReorderBufferXidSetTimetravel(reorder, xid);
+
+							}
+
+							SnapBuildCommitTxn(snapstate, reorder,
+											   buf->origptr, xid,
+											   xlrec->nsubxacts,
+											   (TransactionId *) &xlrec->xnodes);
+							break;
+						}
+					case XLOG_XACT_COMMIT_COMPACT:
+						{
+							xl_xact_commit_compact *xlrec =
+								(xl_xact_commit_compact *) buf->record_data;
+
+							ret = SNAPBUILD_DECODE;
+
+							SnapBuildCommitTxn(snapstate, reorder,
+											   buf->origptr, xid,
+											   xlrec->nsubxacts,
+											   xlrec->subxacts);
+							break;
+						}
+					case XLOG_XACT_COMMIT_PREPARED:
+						{
+							xl_xact_commit_prepared *xlrec =
+								(xl_xact_commit_prepared *) buf->record_data;
+
+							/* FIXME: check for invalidation messages! */
+
+							SnapBuildCommitTxn(snapstate, reorder,
+											   buf->origptr, xlrec->xid,
+											   xlrec->crec.nsubxacts,
+											   (TransactionId *) &xlrec->crec.xnodes);
+
+							ret = SNAPBUILD_DECODE;
+							break;
+						}
+					case XLOG_XACT_ABORT:
+						{
+							xl_xact_abort *xlrec =
+								(xl_xact_abort *) buf->record_data;
+
+							SnapBuildAbortTxn(snapstate, xid, xlrec->nsubxacts,
+											  (TransactionId *) &(xlrec->xnodes[xlrec->nrels]));
+							ret = SNAPBUILD_DECODE;
+							break;
+						}
+					case XLOG_XACT_ABORT_PREPARED:
+						{
+							xl_xact_abort_prepared *xlrec =
+								(xl_xact_abort_prepared *) buf->record_data;
+							xl_xact_abort *arec = &xlrec->arec;
+
+							SnapBuildAbortTxn(snapstate, xlrec->xid,
+											  arec->nsubxacts,
+											  (TransactionId *) &(arec->xnodes[arec->nrels]));
+							ret = SNAPBUILD_DECODE;
+							break;
+						}
+					case XLOG_XACT_ASSIGNMENT:
+						break;
+					case XLOG_XACT_PREPARE:
+						/*
+						 * XXX: We could take note of all in-progress prepared
+						 * xacts so we can use shutdown checkpoints to abort
+						 * in-progress transactions...
+						 */
+					default:
+						break;
+						;
+				}
+				break;
+			}
+		case RM_HEAP_ID:
+			{
+				switch (info & XLOG_HEAP_OPMASK)
+				{
+					case XLOG_HEAP_INPLACE:
+						{
+							xl_heap_inplace *xlrec =
+								(xl_heap_inplace *) buf->record_data;
+
+							ret = SnapBuildProcessChange(reorder, snapstate,
+														 xid, buf,
+														 &xlrec->target.node);
+
+							/*
+							 * inplace records happen in catalog modifying
+							 * txn's
+							 */
+							ReorderBufferXidSetTimetravel(reorder, xid);
+
+							break;
+						}
+					/*
+					 * we only ever read changes, so row level locks
+					 * aren't interesting
+					 */
+					case XLOG_HEAP_LOCK:
+						break;
+
+					case XLOG_HEAP_INSERT:
+						{
+							xl_heap_insert *xlrec =
+								(xl_heap_insert *) buf->record_data;
+
+							ret = SnapBuildProcessChange(reorder, snapstate,
+														 xid, buf,
+														 &xlrec->target.node);
+							break;
+						}
+					/* HEAP(_HOT)?_UPDATE use the same data layout */
+					case XLOG_HEAP_UPDATE:
+					case XLOG_HEAP_HOT_UPDATE:
+						{
+							xl_heap_update *xlrec =
+								(xl_heap_update *) buf->record_data;
+
+							ret = SnapBuildProcessChange(reorder, snapstate,
+														 xid, buf,
+														 &xlrec->target.node);
+							break;
+						}
+					case XLOG_HEAP_DELETE:
+						{
+							xl_heap_delete *xlrec =
+								(xl_heap_delete *) buf->record_data;
+
+							ret = SnapBuildProcessChange(reorder, snapstate,
+														 xid, buf,
+														 &xlrec->target.node);
+							break;
+						}
+					default:
+						;
+				}
+				break;
+			}
+		case RM_HEAP2_ID:
+			{
+				switch (info)
+				{
+					case XLOG_HEAP2_MULTI_INSERT:
+						{
+							xl_heap_multi_insert *xlrec =
+								(xl_heap_multi_insert *) buf->record_data;
+
+							ret = SnapBuildProcessChange(reorder, snapstate, xid,
+														 buf, &xlrec->node);
+							break;
+						}
+					case XLOG_HEAP2_NEW_CID:
+						{
+							CommandId cid;
+
+							xl_heap_new_cid *xlrec =
+								(xl_heap_new_cid *) buf->record_data;
+#if 0
+							elog(WARNING, "found new cid in xid %u: relfilenode %u/%u/%u: tid: (%u, %u) cmin: %u, cmax: %u, combo: %u",
+								 xlrec->top_xid,
+								 xlrec->target.node.dbNode, xlrec->target.node.spcNode,	xlrec->target.node.relNode,
+								 BlockIdGetBlockNumber(&xlrec->target.tid.ip_blkid), xlrec->target.tid.ip_posid,
+								 xlrec->cmin, xlrec->cmax, xlrec->combocid);
+#endif
+							/* we only log new_cid's if a catalog tuple was modified */
+							ReorderBufferXidSetTimetravel(reorder, xid);
+
+							if (!ReorderBufferXidHasBaseSnapshot(reorder, xid))
+							{
+								if (!snapstate->snapshot) {
+									snapstate->snapshot = SnapBuildBuildSnapshot(snapstate, xid);
+									/* refcount of the snapshot builder */
+									SnapBuildSnapIncRefcount(snapstate->snapshot);
+								}
+
+								/* refcount of the transaction */
+								SnapBuildSnapIncRefcount(snapstate->snapshot);
+
+								ReorderBufferAddBaseSnapshot(reorder, xid,
+														  InvalidXLogRecPtr,
+														  snapstate->snapshot);
+							}
+
+							ReorderBufferAddNewTupleCids(reorder, xlrec->top_xid, buf->origptr,
+													  xlrec->target.node, xlrec->target.tid,
+													  xlrec->cmin, xlrec->cmax, xlrec->combocid);
+
+							/* figure out new command id */
+							if (xlrec->cmin != InvalidCommandId && xlrec->cmax != InvalidCommandId)
+								cid = Max(xlrec->cmin, xlrec->cmax);
+							else if (xlrec->cmax != InvalidCommandId)
+								cid = xlrec->cmax;
+							else if (xlrec->cmin != InvalidCommandId)
+								cid = xlrec->cmin;
+							else
+							{
+								cid = InvalidCommandId; /* silence compiler */
+								elog(ERROR, "broken arrow, no cid?");
+							}
+							/*
+							 * FIXME: potential race condition here: if
+							 * multiple snapshots were running & generating
+							 * changes in the same transaction on the source
+							 * side this could be problematic.  But this cannot
+							 * happen for system catalogs, right?
+							 */
+							ReorderBufferAddNewCommandId(reorder, xid, buf->origptr,
+													  cid + 1);
+						}
+					default:
+						;
+				}
+			}
+			break;
+	}
+
+	return ret;
+}
+
+
+/*
+ * check whether `xid` is currently running
+ */
+static bool
+SnapBuildTxnIsRunning(Snapstate *snapstate, TransactionId xid)
+{
+	if (snapstate->nrrunning &&
+		NormalTransactionIdFollows(xid, snapstate->xmin_running) &&
+		NormalTransactionIdPrecedes(xid, snapstate->xmax_running))
+	{
+		TransactionId *search =
+			bsearch(&xid, snapstate->running, snapstate->nrrunning_initial,
+					sizeof(TransactionId), xidComparator);
+
+		if (search != NULL)
+		{
+			Assert(*search == xid);
+			return true;
+		}
+	}
+
+	return false;
+}
+
+/*
+ * FIXME: Analogous struct to the private one in reorderbuffer.c.
+ *
+ * Maybe introduce reorderbuffer_internal.h?
+ */
+typedef struct ReorderBufferTXNByIdEnt
+{
+	TransactionId xid;
+	ReorderBufferTXN *txn;
+}  ReorderBufferTXNByIdEnt;
+
+/*
+ * Add a new SnapshotNow to all transactions we're decoding that currently are
+ * in-progress so they can see new catalog contents made by the transaction
+ * that just committed.
+ */
+static void
+SnapBuildDistributeSnapshotNow(Snapstate *snapstate, ReorderBuffer *reorder, XLogRecPtr lsn)
+{
+	HASH_SEQ_STATUS status;
+	ReorderBufferTXNByIdEnt* ent;
+	elog(DEBUG1, "distributing snapshots to all running transactions");
+
+	hash_seq_init(&status, reorder->by_txn);
+
+	/*
+	 * FIXME: were providing snapshots the txn that committed just now...
+	 *
+	 * XXX: If we would handle XLOG_ASSIGNMENT records we could avoid handing
+	 * out snapshots to transactions that we recognize as being subtransactions.
+	 */
+	while ((ent = (ReorderBufferTXNByIdEnt*) hash_seq_search(&status)) != NULL)
+	{
+		if (ReorderBufferXidHasBaseSnapshot(reorder, ent->xid))
+		{
+			SnapBuildSnapIncRefcount(snapstate->snapshot);
+			ReorderBufferAddBaseSnapshot(reorder, ent->xid, lsn, snapstate->snapshot);
+		}
+	}
+}
+
+/*
+ * Keep track of a new catalog changing transaction that has committed
+ */
+static void
+SnapBuildAddCommittedTxn(Snapstate *snapstate, TransactionId xid)
+{
+	if (snapstate->nrcommitted == snapstate->nrcommitted_space)
+	{
+		elog(WARNING, "increasing space for committed transactions");
+
+		snapstate->nrcommitted_space *= 2;
+		snapstate->committed = realloc(snapstate->committed,
+									   snapstate->nrcommitted_space * sizeof(TransactionId));
+		if (!snapstate->committed)
+			elog(ERROR, "couldn't enlarge space for committed transactions");
+	}
+	snapstate->committed[snapstate->nrcommitted++] = xid;
+}
+
+/*
+ * Remove all transactions we treat as committed that are smaller than
+ * ->xmin. Those won't ever get checked via the ->commited array anyway.
+ */
+static void
+SnapBuildPurgeCommittedTxn(Snapstate *snapstate)
+{
+	int off;
+	TransactionId *workspace;
+	int surviving_xids = 0;
+
+	/* XXX: Neater algorithm? */
+	workspace = malloc(snapstate->nrcommitted * sizeof(TransactionId));
+
+	if (!workspace)
+		elog(ERROR, "could not allocate memory for workspace during xmin purging");
+
+	for (off = 0; off < snapstate->nrcommitted; off++)
+	{
+		if (snapstate->committed[off] > snapstate->xmin)
+			workspace[surviving_xids++] = snapstate->committed[off];
+	}
+
+	memcpy(snapstate->committed, workspace,
+		   surviving_xids * sizeof(TransactionId));
+
+	snapstate->nrcommitted = surviving_xids;
+	free(workspace);
+}
+
+/*
+ * Common logic for SnapBuildAbortTxn and SnapBuildCommitTxn dealing with
+ * keeping track of the amount of running transactions.
+ */
+static void
+SnapBuildEndTxn(Snapstate *snapstate, TransactionId xid)
+{
+	if (snapstate->state == SNAPBUILD_CONSISTENT)
+		return;
+
+	if (SnapBuildTxnIsRunning(snapstate, xid))
+	{
+		if (!--snapstate->nrrunning)
+		{
+			/*
+			 * none of the originally running transaction is running
+			 * anymore. Due to that our incrementaly built snapshot now is
+			 * complete.
+			 */
+			elog(LOG, "found consistent point due to SnapBuildEndTxn + running: %u", xid);
+			snapstate->state = SNAPBUILD_CONSISTENT;
+		}
+	}
+}
+
+/* Abort a transaction, throw away all state we kept */
+static void
+SnapBuildAbortTxn(Snapstate *snapstate, TransactionId xid, int nsubxacts, TransactionId *subxacts)
+{
+	int i;
+
+	for (i = 0; i < nsubxacts; i++)
+	{
+		TransactionId subxid = subxacts[i];
+		SnapBuildEndTxn(snapstate, subxid);
+	}
+
+	SnapBuildEndTxn(snapstate, xid);
+}
+
+/* Handle everything that needs to be done when a transaction commits */
+static void
+SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,
+				   XLogRecPtr lsn, TransactionId xid,
+				   int nsubxacts, TransactionId *subxacts)
+{
+	int nxact;
+
+	bool forced_timetravel = false;
+	bool sub_does_timetravel = false;
+	bool top_does_timetravel = false;
+
+	/*
+	 * If we couldn't observe every change of a transaction because it was
+	 * already running at the point we started to observe we have to assume it
+	 * made catalog changes.
+	 *
+	 * This has the positive benefit that we afterwards have enough information
+	 * to build an exportable snapshot thats usable by pg_dump et al.
+	 */
+	if (snapstate->state < SNAPBUILD_CONSISTENT)
+	{
+		if (XLByteLT(snapstate->transactions_after, lsn))
+			snapstate->transactions_after = lsn;
+
+		if (SnapBuildTxnIsRunning(snapstate, xid))
+		{
+			elog(LOG, "forced to assume catalog changes for xid %u because it was running to early", xid);
+			SnapBuildAddCommittedTxn(snapstate, xid);
+			forced_timetravel = true;
+		}
+	}
+
+	for (nxact = 0; nxact < nsubxacts; nxact++)
+	{
+		TransactionId subxid = subxacts[nxact];
+
+		/*
+		 * make sure txn is not tracked in running txn's anymore, switch
+		 * state
+		 */
+		SnapBuildEndTxn(snapstate, subxid);
+
+		if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(snapstate, subxid);
+		}
+		/*
+		 * add subtransaction to base snapshot, we don't distinguish after
+		 * that
+		 */
+		else if (ReorderBufferXidDoesTimetravel(reorder, subxid))
+		{
+			sub_does_timetravel = true;
+
+			elog(DEBUG1, "found subtransaction %u:%u with catalog changes.",
+				 xid, subxid);
+
+			SnapBuildAddCommittedTxn(snapstate, subxid);
+		}
+
+
+		if (forced_timetravel && sub_does_timetravel &&
+			NormalTransactionIdFollows(subxid, snapstate->xmax))
+		{
+			snapstate->xmax = subxid;
+			TransactionIdAdvance(snapstate->xmax);
+		}
+	}
+
+	/*
+	 * make sure txn is not tracked in running txn's anymore, switch state
+	 */
+	SnapBuildEndTxn(snapstate, xid);
+
+	if (forced_timetravel)
+	{
+		elog(DEBUG1, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(snapstate, xid);
+	}
+	/* add toplevel transaction to base snapshot */
+	else if (ReorderBufferXidDoesTimetravel(reorder, xid))
+	{
+		elog(DEBUG1, "found top level transaction %u, with catalog changes!", xid);
+
+		top_does_timetravel = true;
+		SnapBuildAddCommittedTxn(snapstate, xid);
+	}
+	else if (sub_does_timetravel)
+	{
+		/* mark toplevel txn as timetravel as well */
+		SnapBuildAddCommittedTxn(snapstate, xid);
+	}
+
+	if (forced_timetravel || top_does_timetravel || sub_does_timetravel)
+	{
+		if (!TransactionIdIsValid(snapstate->xmax) ||
+			NormalTransactionIdFollows(xid, snapstate->xmax))
+		{
+			snapstate->xmax = xid;
+			TransactionIdAdvance(snapstate->xmax);
+		}
+
+		if (snapstate->state < SNAPBUILD_FULL_SNAPSHOT)
+			return;
+
+		if (snapstate->snapshot) {
+			/* refcount of the transaction */
+			SnapBuildSnapDecRefcount(snapstate->snapshot);
+		}
+
+		snapstate->snapshot = SnapBuildBuildSnapshot(snapstate, xid);
+
+		/* refcount of the snapshot builder */
+		SnapBuildSnapIncRefcount(snapstate->snapshot);
+
+		/* add a new SnapshotNow to all currently running transactions */
+		SnapBuildDistributeSnapshotNow(snapstate, reorder, lsn);
+	}
+}
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index b6cfdac..07f1cdb 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -76,9 +76,11 @@ Node *replication_parse_result;
 %token K_NOWAIT
 %token K_WAL
 %token K_START_REPLICATION
+%token K_INIT_LOGICAL_REPLICATION
+%token K_START_LOGICAL_REPLICATION
 
 %type <node>	command
-%type <node>	base_backup start_replication identify_system
+%type <node>	base_backup start_replication identify_system start_logical_replication init_logical_replication
 %type <list>	base_backup_opt_list
 %type <defelt>	base_backup_opt
 %%
@@ -97,6 +99,8 @@ command:
 			identify_system
 			| base_backup
 			| start_replication
+			| init_logical_replication
+			| start_logical_replication
 			;
 
 /*
@@ -166,6 +170,32 @@ start_replication:
 					$$ = (Node *) cmd;
 				}
 			;
+
+/* FIXME: don't use SCONST */
+init_logical_replication:
+			K_INIT_LOGICAL_REPLICATION SCONST
+				{
+					InitLogicalReplicationCmd *cmd;
+					cmd = makeNode(InitLogicalReplicationCmd);
+					cmd->plugin = $2;
+
+					$$ = (Node *) cmd;
+				}
+			;
+
+/* FIXME: don't use SCONST */
+start_logical_replication:
+			K_START_LOGICAL_REPLICATION SCONST RECPTR
+				{
+					StartLogicalReplicationCmd *cmd;
+					cmd = makeNode(StartLogicalReplicationCmd);
+					cmd->name = $2;
+					cmd->startpoint = $3;
+
+					$$ = (Node *) cmd;
+				}
+			;
+
 %%
 
 #include "repl_scanner.c"
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 51f381d..58f7972 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -64,6 +64,8 @@ NOWAIT			{ return K_NOWAIT; }
 PROGRESS			{ return K_PROGRESS; }
 WAL			{ return K_WAL; }
 START_REPLICATION	{ return K_START_REPLICATION; }
+INIT_LOGICAL_REPLICATION	{ return K_INIT_LOGICAL_REPLICATION; }
+START_LOGICAL_REPLICATION	{ return K_START_LOGICAL_REPLICATION; }
 ","				{ return ','; }
 ";"				{ return ';'; }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6452c34..4713847 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -52,6 +52,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
+#include "replication/logicalfuncs.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
@@ -83,6 +84,9 @@ WalSndCtlData *WalSndCtl = NULL;
 /* My slot in the shared memory array */
 WalSnd	   *MyWalSnd = NULL;
 
+/* My slot for logical rep in the shared memory array */
+LogicalWalSnd *MyLogicalWalSnd = NULL;
+
 /* Global state */
 bool		am_walsender = false;		/* Am I a walsender process ? */
 bool		am_cascading_walsender = false;		/* Am I cascading WAL to
@@ -92,6 +96,7 @@ static bool	replication_started = false; /* Started streaming yet? */
 
 /* User-settable parameters for walsender */
 int			max_wal_senders = 0;	/* the maximum number of concurrent walsenders */
+int			max_logical_slots = 0;	/* the maximum number of logical slots */
 int			wal_sender_timeout = 60 * 1000;	/* maximum time to send one
 												 * WAL data message */
 /*
@@ -129,18 +134,30 @@ static bool	ping_sent = false;
 static volatile sig_atomic_t got_SIGHUP = false;
 volatile sig_atomic_t walsender_ready_to_stop = false;
 
+/* XXX reader */
+static MemoryContext decoding_ctx = NULL;
+static MemoryContext old_decoding_ctx = NULL;
+
+static XLogReaderState *logical_reader = NULL;
+
+
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
 static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
-static void WalSndLoop(void) __attribute__((noreturn));
+typedef void (*WalSndSendData)(bool *);
+static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
 static void InitWalSenderSlot(void);
 static void WalSndKill(int code, Datum arg);
-static void XLogSend(bool *caughtup);
+static void XLogSendPhysical(bool *caughtup);
+static void XLogSendLogical(bool *caughtup);
 static void IdentifySystem(void);
 static void StartReplication(StartReplicationCmd *cmd);
+static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
+static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+static void ComputeLogicalXmin(void);
 static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
@@ -192,6 +209,58 @@ WalSndErrorCleanup()
 		proc_exit(0);
 }
 
+extern void
+IncreaseLogicalXminForSlot(XLogRecPtr lsn, TransactionId xmin)
+{
+	Assert(MyLogicalWalSnd != NULL);
+
+	SpinLockAcquire(&MyLogicalWalSnd->mutex);
+	/*
+	 * Only increase if the previous value has been applied...
+	 */
+	if (!TransactionIdIsValid(MyLogicalWalSnd->candidate_xmin))
+	{
+		MyLogicalWalSnd->candidate_xmin_after = lsn;
+		MyLogicalWalSnd->candidate_xmin = xmin;
+		elog(LOG, "got new xmin %u at %lu", xmin, lsn);
+	}
+	SpinLockRelease(&MyLogicalWalSnd->mutex);
+}
+
+static void
+ComputeLogicalXmin(void)
+{
+	int slot;
+	TransactionId xmin = InvalidTransactionId;
+	LogicalWalSnd *logical_base;
+	LogicalWalSnd *walsnd;
+
+	logical_base = (LogicalWalSnd*)&WalSndCtl->walsnds[max_wal_senders];
+
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
+	for (slot = 0; slot < max_logical_slots; slot++)
+	{
+		walsnd = &logical_base[slot];
+
+		SpinLockAcquire(&walsnd->mutex);
+		if (walsnd->in_use &&
+			TransactionIdIsValid(walsnd->xmin) && (
+				!TransactionIdIsValid(xmin) ||
+				TransactionIdPrecedes(walsnd->xmin, xmin))
+			)
+		{
+			xmin = walsnd->xmin;
+		}
+		SpinLockRelease(&walsnd->mutex);
+	}
+	WalSndCtl->logical_xmin = xmin;
+	LWLockRelease(ProcArrayLock);
+
+	elog(LOG, "computed new xmin: %u", xmin);
+
+}
+
 /*
  * IDENTIFY_SYSTEM
  */
@@ -376,7 +445,362 @@ StartReplication(StartReplicationCmd *cmd)
 	SyncRepInitConfig();
 
 	/* Main loop of walsender */
-	WalSndLoop();
+	WalSndLoop(XLogSendPhysical);
+}
+
+static void
+InitLogicalReplication(InitLogicalReplicationCmd *cmd)
+{
+	int slot;
+	LogicalWalSnd *logical_base;
+	LogicalWalSnd *walsnd;
+	char *slot_name;
+	StringInfoData buf;
+	char		xpos[MAXFNAMELEN];
+
+	elog(WARNING, "Initiating logical rep");
+
+	logical_base = (LogicalWalSnd*)&WalSndCtl->walsnds[max_wal_senders];
+
+	Assert(!MyLogicalWalSnd);
+
+	for (slot = 0; slot < max_logical_slots; slot++)
+	{
+		walsnd = &logical_base[slot];
+
+		SpinLockAcquire(&walsnd->mutex);
+		if (!walsnd->in_use)
+		{
+			Assert(!walsnd->active);
+			/* NOT releasing the lock yet */
+			break;
+		}
+		SpinLockRelease(&walsnd->mutex);
+		walsnd = NULL;
+	}
+
+	if (!walsnd)
+	{
+		elog(ERROR, "couldn't find free logical slot. free one or increase max_logical_slots");
+	}
+
+	/* so we get reset on exit/failure */
+	MyLogicalWalSnd = walsnd;
+	MyLogicalWalSnd->last_required_checkpoint = GetRedoRecPtr();
+
+	walsnd->in_use = true;
+	/* mark slot as active till we build the base snapshot */
+	walsnd->active = true;
+	walsnd->xmin = InvalidTransactionId;
+
+	strcpy(NameStr(walsnd->plugin), cmd->plugin);
+
+	/* FIXME: permanent name allocation scheme */
+	slot_name = NameStr(walsnd->name);
+	sprintf(slot_name, "id-%d", slot);
+
+	/* release spinlock, so this slot can be examined  */
+	SpinLockRelease(&walsnd->mutex);
+
+	/*
+	 * FIXME: think about race conditions here...
+	 *
+	 * We need to do this *after* releasing the spinlock, otherwise
+	 * GetOldestXmin will deadlock with ourselves.
+	 */
+	walsnd->xmin = GetOldestXmin(true, true);
+	ComputeLogicalXmin();
+
+	decoding_ctx = AllocSetContextCreate(TopMemoryContext,
+										 "decoding context",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+	old_decoding_ctx = MemoryContextSwitchTo(decoding_ctx);
+	TopTransactionContext = decoding_ctx;
+
+	logical_reader = initial_snapshot_reader();
+
+	logical_reader->startptr = MyLogicalWalSnd->last_required_checkpoint;
+
+	for (;;)
+	{
+		ResetLatch(&MyWalSnd->latch);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/* Process any requests or signals received recently */
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		logical_reader->endptr = GetFlushRecPtr();
+
+		/* continue building initial snapshot */
+		XLogReaderRead(logical_reader);
+
+		if (logical_reader->needs_input || !initial_snapshot_ready(logical_reader))
+		{
+			long		sleeptime = 10000;		/* 10 s */
+			int			wakeEvents;
+
+			wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT;
+
+			/* Sleep until something happens or we time out */
+			ImmediateInterruptOK = true;
+			CHECK_FOR_INTERRUPTS();
+			WaitLatch(&MyWalSnd->latch, wakeEvents,
+					  sleeptime);
+			ImmediateInterruptOK = false;
+		}
+		else
+			break;
+	}
+
+	walsnd->confirmed_flush = logical_reader->curptr;
+
+	snprintf(xpos, sizeof(xpos), "%X/%X",
+			 (uint32) (logical_reader->curptr >> 32), (uint32) logical_reader->curptr);
+
+	pq_beginmessage(&buf, 'T');
+	pq_sendint(&buf, 4, 2);		/* 4 fields */
+
+	/* first field */
+	pq_sendstring(&buf, "replication_id");	/* col name */
+	pq_sendint(&buf, 0, 4);		/* table oid */
+	pq_sendint(&buf, 0, 2);		/* attnum */
+	pq_sendint(&buf, TEXTOID, 4);		/* type oid */
+	pq_sendint(&buf, -1, 2);	/* typlen */
+	pq_sendint(&buf, 0, 4);		/* typmod */
+	pq_sendint(&buf, 0, 2);		/* format code */
+
+	pq_sendstring(&buf, "consistent_point");	/* col name */
+	pq_sendint(&buf, 0, 4);		/* table oid */
+	pq_sendint(&buf, 0, 2);		/* attnum */
+	pq_sendint(&buf, TEXTOID, 4);		/* type oid */
+	pq_sendint(&buf, -1, 2);	/* typlen */
+	pq_sendint(&buf, 0, 4);		/* typmod */
+	pq_sendint(&buf, 0, 2);		/* format code */
+
+	pq_sendstring(&buf, "snapshot_name");	/* col name */
+	pq_sendint(&buf, 0, 4);		/* table oid */
+	pq_sendint(&buf, 0, 2);		/* attnum */
+	pq_sendint(&buf, TEXTOID, 4);		/* type oid */
+	pq_sendint(&buf, -1, 2);	/* typlen */
+	pq_sendint(&buf, 0, 4);		/* typmod */
+	pq_sendint(&buf, 0, 2);		/* format code */
+
+	pq_sendstring(&buf, "plugin");	/* col name */
+	pq_sendint(&buf, 0, 4);		/* table oid */
+	pq_sendint(&buf, 0, 2);		/* attnum */
+	pq_sendint(&buf, TEXTOID, 4);		/* type oid */
+	pq_sendint(&buf, -1, 2);	/* typlen */
+	pq_sendint(&buf, 0, 4);		/* typmod */
+	pq_sendint(&buf, 0, 2);		/* format code */
+
+	pq_endmessage(&buf);
+
+	/* Send a DataRow message */
+	pq_beginmessage(&buf, 'D');
+	pq_sendint(&buf, 4, 2);		/* # of columns */
+
+	pq_sendint(&buf, strlen(slot_name), 4); /* col1 len */
+	pq_sendbytes(&buf, slot_name, strlen(slot_name));
+
+	pq_sendint(&buf, strlen(xpos), 4); /* col2 len */
+	pq_sendbytes(&buf, xpos, strlen(xpos));
+
+	pq_sendint(&buf, strlen("0xDEADBEEF"), 4); /* col3 len */
+	pq_sendbytes(&buf, "0xDEADBEEF", strlen("0xDEADBEEF"));
+
+	pq_sendint(&buf, strlen(cmd->plugin), 4); /* col4 len */
+	pq_sendbytes(&buf, cmd->plugin, strlen(cmd->plugin));
+
+	pq_endmessage(&buf);
+
+	SpinLockAcquire(&walsnd->mutex);
+	walsnd->active = false;
+	MyLogicalWalSnd = NULL;
+	SpinLockRelease(&walsnd->mutex);
+
+	MemoryContextSwitchTo(old_decoding_ctx);
+	TopTransactionContext = NULL;
+}
+
+
+
+static void
+StartLogicalReplication(StartLogicalReplicationCmd *cmd)
+{
+	StringInfoData buf;
+
+	int slot;
+	LogicalWalSnd *logical_base;
+	logical_base = (LogicalWalSnd*)&WalSndCtl->walsnds[max_wal_senders];
+
+	elog(WARNING, "Starting logical replication");
+
+	Assert(!MyLogicalWalSnd);
+
+	for (slot = 0; slot < max_logical_slots; slot++)
+	{
+		LogicalWalSnd   *walsnd = &logical_base[slot];
+
+		SpinLockAcquire(&walsnd->mutex);
+		if (walsnd->in_use && strcmp(cmd->name, NameStr(walsnd->name)) == 0)
+		{
+			MyLogicalWalSnd = walsnd;
+			/* NOT releasing the lock yet */
+			break;
+		}
+		SpinLockRelease(&walsnd->mutex);
+	}
+
+	if (!MyLogicalWalSnd)
+		elog(ERROR, "couldn't find logical slot for \"%s\"",
+		     cmd->name);
+
+	if (MyLogicalWalSnd->active)
+	{
+		SpinLockRelease(&MyLogicalWalSnd->mutex);
+		elog(ERROR, "slot already active");
+	}
+
+	MyLogicalWalSnd->active = true;
+	SpinLockRelease(&MyLogicalWalSnd->mutex);
+
+	MarkPostmasterChildWalSender();
+	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
+	replication_started = true;
+
+	if (am_cascading_walsender && !RecoveryInProgress())
+	{
+		ereport(LOG,
+		   (errmsg("terminating walsender process to force cascaded standby "
+				   "to update timeline and reconnect")));
+		walsender_ready_to_stop = true;
+	}
+
+	WalSndSetState(WALSNDSTATE_CATCHUP);
+
+	/* Send a CopyBothResponse message, and start streaming */
+	pq_beginmessage(&buf, 'W');
+	pq_sendbyte(&buf, 0);
+	pq_sendint(&buf, 0, 2);
+	pq_endmessage(&buf);
+	pq_flush();
+
+	/*
+	 * Initialize position to the received one, then the xlog records begin to
+	 * be shipped from that position
+	 */
+	sentPtr = MyLogicalWalSnd->last_required_checkpoint;
+
+	/* Also update the start position status in shared memory */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = MyLogicalWalSnd->last_required_checkpoint;
+		SpinLockRelease(&walsnd->mutex);
+	}
+
+	SyncRepInitConfig();
+
+	logical_reader = normal_snapshot_reader(NameStr(MyLogicalWalSnd->plugin),
+											cmd->startpoint);
+
+	logical_reader->startptr = MyLogicalWalSnd->last_required_checkpoint;
+	logical_reader->curptr = logical_reader->startptr;
+	logical_reader->endptr = GetFlushRecPtr();
+
+	/* Main loop of walsender */
+	WalSndLoop(XLogSendLogical);
+}
+
+/*
+ * Prepare a write into a StringInfo.
+ *
+ * Don't do anything lasting in here, its quite possible that nothing will done
+ * with the data.
+ */
+void
+WalSndPrepareWrite(StringInfo out, XLogRecPtr lsn)
+{
+	pq_sendbyte(out, 'w');
+	pq_sendint64(out, lsn);	/* dataStart */
+	pq_sendint64(out, lsn);	/* walEnd */
+	/* XXX: gather that value later just as its done in XLogSendPhysical */
+	pq_sendint64(out, 0 /*GetCurrentIntegerTimestamp() */);/* sendtime */
+}
+
+/*
+ * Actually write out data previously prepared by WalSndPrepareWrite out to the
+ * network, take as long as needed but process replies from the other side
+ * during that.
+ */
+void
+WalSndWriteData(StringInfo data)
+{
+	long		sleeptime = 10000;		/* 10 s */
+	int			wakeEvents;
+
+	pq_putmessage_noblock('d', data->data, data->len);
+
+	for (;;)
+	{
+		if (!pq_is_send_pending())
+			return;
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/* Process any requests or signals received recently */
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			SyncRepInitConfig();
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for input from the client */
+		ProcessRepliesIfAny();
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyWalSnd->latch);
+
+		/* Try to flush pending output to the client */
+		if (pq_flush_if_writable() != 0)
+			break;
+
+		if (!pq_is_send_pending())
+			return;
+
+		/* FIXME: wal_sender_timeout integration */
+		wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
+			WL_SOCKET_WRITEABLE | WL_SOCKET_READABLE | WL_TIMEOUT;
+
+		ImmediateInterruptOK = true;
+		CHECK_FOR_INTERRUPTS();
+		WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents,
+						  MyProcPort->sock, sleeptime);
+		ImmediateInterruptOK = false;
+	}
+	SetLatch(&MyWalSnd->latch);
 }
 
 /*
@@ -421,6 +845,14 @@ exec_replication_command(const char *cmd_string)
 			StartReplication((StartReplicationCmd *) cmd_node);
 			break;
 
+		case T_InitLogicalReplicationCmd:
+			InitLogicalReplication((InitLogicalReplicationCmd *) cmd_node);
+			break;
+
+		case T_StartLogicalReplicationCmd:
+			StartLogicalReplication((StartLogicalReplicationCmd *) cmd_node);
+			break;
+
 		case T_BaseBackupCmd:
 			SendBaseBackup((BaseBackupCmd *) cmd_node);
 			break;
@@ -588,6 +1020,37 @@ ProcessStandbyReplyMessage(void)
 		SpinLockRelease(&walsnd->mutex);
 	}
 
+	/*
+	 * Do an unlocked check for candidate_xmin first.
+	 */
+	if (MyLogicalWalSnd &&
+		TransactionIdIsValid(MyLogicalWalSnd->candidate_xmin))
+	{
+		bool updated_xmin = false;
+
+		/* use volatile pointer to prevent code rearrangement */
+		volatile LogicalWalSnd *walsnd = MyLogicalWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+
+		/* if were past the location required for bumping xmin, do so */
+		if (TransactionIdIsValid(walsnd->candidate_xmin) &&
+			flushPtr != InvalidXLogRecPtr &&
+			XLByteLE(walsnd->candidate_xmin_after, flushPtr)
+			)
+		{
+			walsnd->xmin = walsnd->candidate_xmin;
+			walsnd->candidate_xmin = InvalidTransactionId;
+			walsnd->candidate_xmin_after = InvalidXLogRecPtr;
+			updated_xmin = true;
+		}
+
+		SpinLockRelease(&walsnd->mutex);
+
+		if (updated_xmin)
+			ComputeLogicalXmin();
+	}
+
 	if (!am_cascading_walsender)
 		SyncRepReleaseWaiters();
 }
@@ -669,7 +1132,7 @@ ProcessStandbyHSFeedbackMessage(void)
 
 /* Main loop of walsender process that streams the WAL over Copy messages. */
 static void
-WalSndLoop(void)
+WalSndLoop(WalSndSendData send_data)
 {
 	bool		caughtup = false;
 
@@ -713,12 +1176,12 @@ WalSndLoop(void)
 
 		/*
 		 * If we don't have any pending data in the output buffer, try to send
-		 * some more.  If there is some, we don't bother to call XLogSend
+		 * some more.  If there is some, we don't bother to call send_data
 		 * again until we've flushed it ... but we'd better assume we are not
 		 * caught up.
 		 */
 		if (!pq_is_send_pending())
-			XLogSend(&caughtup);
+			send_data(&caughtup);
 		else
 			caughtup = false;
 
@@ -754,7 +1217,7 @@ WalSndLoop(void)
 			if (walsender_ready_to_stop)
 			{
 				/* ... let's just be real sure we're caught up ... */
-				XLogSend(&caughtup);
+				send_data(&caughtup);
 				if (caughtup && !pq_is_send_pending())
 				{
 					/* Inform the standby that XLOG streaming is done */
@@ -769,7 +1232,7 @@ WalSndLoop(void)
 		/*
 		 * We don't block if not caught up, unless there is unsent data
 		 * pending in which case we'd better block until the socket is
-		 * write-ready.  This test is only needed for the case where XLogSend
+		 * write-ready.  This test is only needed for the case where send_data
 		 * loaded a subset of the available data but then pq_flush_if_writable
 		 * flushed it all --- we should immediately try to send more.
 		 */
@@ -917,6 +1380,13 @@ WalSndKill(int code, Datum arg)
 	 * for this.
 	 */
 	MyWalSnd->pid = 0;
+
+	/* LOCK? */
+	if(MyLogicalWalSnd && MyLogicalWalSnd->active)
+	{
+		MyLogicalWalSnd->active = false;
+	}
+
 	DisownLatch(&MyWalSnd->latch);
 
 	/* WalSnd struct isn't mine anymore */
@@ -1068,6 +1538,8 @@ retry:
 }
 
 /*
+ * Send out the WAL in its normal physical/stored form.
+ *
  * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
  * but not yet sent to the client, and buffer it in the libpq output
  * buffer.
@@ -1076,7 +1548,7 @@ retry:
  * *caughtup is set to false.
  */
 static void
-XLogSend(bool *caughtup)
+XLogSendPhysical(bool *caughtup)
 {
 	XLogRecPtr	SendRqstPtr;
 	XLogRecPtr	startptr;
@@ -1210,6 +1682,65 @@ XLogSend(bool *caughtup)
 }
 
 /*
+ * Send out the WAL after it being decoded into a logical format by the output
+ * plugin specified in INIT_LOGICAL_DECODING
+ */
+static void
+XLogSendLogical(bool *caughtup)
+{
+	XLogRecPtr	endptr;
+	XLogRecPtr	curptr;
+
+	if (decoding_ctx == NULL)
+	{
+		decoding_ctx = AllocSetContextCreate(TopMemoryContext,
+											 "decoding context",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	}
+
+	logical_reader->endptr = logical_reader->curptr;
+	curptr = logical_reader->curptr;
+
+	/*
+	 * read at most MAX_SEND_SIZE of wal. We chunk the reading only to allow
+	 * reading keepalives and such inbetween.
+	 */
+	XLByteAdvance(logical_reader->endptr, MAX_SEND_SIZE);
+
+	/* only read up to already flushed wal */
+	endptr = GetFlushRecPtr();
+	if (XLByteLT(endptr, logical_reader->endptr))
+		logical_reader->endptr = endptr;
+
+	old_decoding_ctx = MemoryContextSwitchTo(decoding_ctx);
+	TopTransactionContext = decoding_ctx;
+
+	/* continue building initial snapshot */
+	XLogReaderRead(logical_reader);
+
+	MemoryContextSwitchTo(old_decoding_ctx);
+	TopTransactionContext = NULL;
+
+	if (curptr == logical_reader->curptr ||
+		logical_reader->curptr == endptr)
+		*caughtup = true;
+	else
+		*caughtup = false;
+
+	/* Update shared memory status */
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = MyWalSnd;
+
+		SpinLockAcquire(&walsnd->mutex);
+		walsnd->sentPtr = logical_reader->curptr;
+		SpinLockRelease(&walsnd->mutex);
+	}
+}
+
+/*
  * Request walsenders to reload the currently-open WAL file
  */
 void
@@ -1301,6 +1832,8 @@ WalSndShmemSize(void)
 	size = offsetof(WalSndCtlData, walsnds);
 	size = add_size(size, mul_size(max_wal_senders, sizeof(WalSnd)));
 
+	size = add_size(size, mul_size(max_logical_slots, sizeof(LogicalWalSnd)));
+
 	return size;
 }
 
@@ -1329,6 +1862,21 @@ WalSndShmemInit(void)
 			SpinLockInit(&walsnd->mutex);
 			InitSharedLatch(&walsnd->latch);
 		}
+
+		WalSndCtl->logical_xmin = InvalidTransactionId;
+
+		if (max_logical_slots > 0)
+		{
+			LogicalWalSnd *logical_base;
+			logical_base = (LogicalWalSnd*)&WalSndCtl->walsnds[max_wal_senders];
+
+			for (i = 0; i < max_logical_slots; i++)
+			{
+				LogicalWalSnd   *walsnd = &logical_base[i];
+				walsnd->xmin = InvalidTransactionId;
+				SpinLockInit(&walsnd->mutex);
+			}
+		}
 	}
 }
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a98358d..da9e7e5 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,6 +51,8 @@
 #include "access/xact.h"
 #include "access/twophase.h"
 #include "miscadmin.h"
+#include "replication/walsender.h"
+#include "replication/walsender_private.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
@@ -1156,6 +1158,14 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 		}
 	}
 
+	if (max_logical_slots > 0 &&
+		TransactionIdIsValid(WalSndCtl->logical_xmin) &&
+		TransactionIdPrecedes(WalSndCtl->logical_xmin, result))
+	{
+		result = WalSndCtl->logical_xmin;
+	}
+
+
 	if (RecoveryInProgress())
 	{
 		/*
@@ -1442,10 +1452,17 @@ GetSnapshotData(Snapshot snapshot)
 			suboverflowed = true;
 	}
 
+	/* FIXME: comment & concurrency */
+	if (TransactionIdIsValid(WalSndCtl->logical_xmin) &&
+		TransactionIdPrecedes(WalSndCtl->logical_xmin, xmin))
+		xmin = WalSndCtl->logical_xmin;
+
 	if (!TransactionIdIsValid(MyPgXact->xmin))
 		MyPgXact->xmin = TransactionXmin = xmin;
+
 	LWLockRelease(ProcArrayLock);
 
+
 	/*
 	 * Update globalxmin to include actual process xids.  This is a slightly
 	 * different way of computing it than GetOldestXmin uses, but should give
@@ -1695,6 +1712,12 @@ GetRunningTransactionData(void)
 		}
 	}
 
+	/*
+	 * Its important *not* to track decoding tasks here because snapbuild.c
+	 * uses ->oldestRunningXid to manage its xmin. If it were to be included
+	 * here the initial value could never increase.
+	 */
+
 	CurrentRunningXacts->xcnt = count - subcount;
 	CurrentRunningXacts->subxcnt = subcount;
 	CurrentRunningXacts->subxid_overflow = suboverflowed;
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 0cab243..6cc112d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -804,7 +804,13 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 			appendStringInfo(buf, " %u", xlrec->xids[i]);
 	}
 
-	if (xlrec->subxid_overflow)
+	if (xlrec->subxcnt > 0)
+	{
+		appendStringInfo(buf, "; %d subxacts:", xlrec->subxcnt);
+		for (i = 0; i < xlrec->subxcnt; i++)
+			appendStringInfo(buf, " %u", xlrec->xids[xlrec->xcnt + i]);
+	}
+	else if (xlrec->subxid_overflow)
 		appendStringInfo(buf, "; subxid ovf");
 }
 
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index e26bf0b..7839a14 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -475,7 +475,7 @@ RegisterRelcacheInvalidation(Oid dbId, Oid relId)
  * Only the local caches are flushed; this does not transmit the message
  * to other backends.
  */
-static void
+void
 LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 {
 	if (msg->id >= 0)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8c9ebe0..7bd2c27 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -506,9 +506,10 @@ RelationBuildTupleDesc(Relation relation)
 	heap_close(pg_attribute_desc, AccessShareLock);
 
 	if (need != 0)
+	{
 		elog(ERROR, "catalog is missing %d attribute(s) for relid %u",
 			 need, RelationGetRelid(relation));
-
+	}
 	/*
 	 * The attcacheoff values we read from pg_attribute should all be -1
 	 * ("unknown").  Verify this if assert checking is on.	They will be
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 745e7be..19a03cef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2030,6 +2030,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		/* see max_connections */
+		{"max_logical_slots", PGC_POSTMASTER, REPLICATION_SENDING,
+			gettext_noop("Sets the maximum number of simultaneously defined WAL decoding slots."),
+			NULL
+		},
+		&max_logical_slots,
+		0, 0, MAX_BACKENDS /*?*/,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"wal_sender_timeout", PGC_SIGHUP, REPLICATION_SENDING,
 			gettext_noop("Sets the maximum time to wait for WAL replication."),
 			NULL,
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f64d52d..e24c712 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -64,6 +64,8 @@
 #include "access/xact.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "utils/combocid.h"
 #include "utils/tqual.h"
 
 
@@ -73,6 +75,8 @@ SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
 SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
 SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
 
+static Snapshot SnapshotNowDecoding;
+
 /* local functions */
 static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
 
@@ -1407,3 +1411,248 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 
 	return false;
 }
+
+static bool
+TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
+{
+	return bsearch(&xid, xip, num,
+	               sizeof(TransactionId), xidComparator) != NULL;
+}
+
+static HTAB *tuplecid_data = NULL;
+
+/*
+ * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+ * obeys.
+ *
+ * Only usable on tuples from catalog tables!
+ *
+ * We don't need to support HEAP_MOVED_(IN|OFF) for now because we only support
+ * reading catalog pages which couldn't have been created in an older version.
+ *
+ * We don't set any hint bits in here as it seems unlikely to be beneficial as
+ * those should already be set by normal access and it seems to be too
+ * dangerous to do so as the semantics of doing so during timetravel are more
+ * complicated than when dealing "only" with the present.
+ */
+bool
+HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot,
+                                     Buffer buffer)
+{
+	HeapTupleHeader tuple = htup->t_data;
+/*#define DEBUG_ME*/
+	TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+	TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+
+	Assert(ItemPointerIsValid(&htup->t_self));
+	Assert(htup->t_tableOid != InvalidOid);
+
+	/* transaction aborted */
+	if (tuple->t_infomask & HEAP_XMIN_INVALID)
+	{
+		Assert(!TransactionIdDidCommit(xmin));
+		goto invisible;
+	}
+    /* check if its one of our txids, toplevel is also in there */
+	else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+	{
+		CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+		CommandId cmax = InvalidCommandId;
+
+		/*
+		 * if another transaction deleted this tuple or if cmin/cmax is stored
+		 * in a combocid we need to to lookup the actual values externally.
+		 */
+		if ((!(tuple->t_infomask & HEAP_XMAX_INVALID) &&
+			 !TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt)) ||
+			tuple->t_infomask & HEAP_COMBOCID
+			)
+		{
+			bool resolved;
+
+			resolved = ResolveCminCmaxDuringDecoding(tuplecid_data, htup,
+													 buffer, &cmin, &cmax);
+
+			if (!resolved)
+				elog(ERROR, "could not resolve cmin/cmax of catalog tuple");
+		}
+
+#ifdef DEBUG_ME
+		elog(LOG, "curcid: %u cmin: %u; invisible: %u", snapshot->curcid, cmin,
+			 cmin >= snapshot->curcid);
+#endif
+		if (cmin >= snapshot->curcid)
+			goto invisible;	/* inserted after scan started */
+	}
+	/* normal transaction state counts */
+	else if (TransactionIdPrecedes(xmin, snapshot->xmin))
+	{
+		Assert(!(tuple->t_infomask & HEAP_XMIN_COMMITTED &&
+				 !TransactionIdDidCommit(xmin)));
+
+		if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED) &&
+			!TransactionIdDidCommit(xmin))
+			goto invisible;
+	}
+	/* beyond our xmax horizon, i.e. invisible */
+	else if (TransactionIdFollows(xmin, snapshot->xmax))
+	{
+		goto invisible;
+	}
+	/* check if we know the transaction has committed */
+	else if(TransactionIdInArray(xmin, snapshot->xip, snapshot->xcnt))
+	{
+	}
+	else
+	{
+		goto invisible;
+	}
+
+	/* at this point we know xmin is visible */
+
+	/* why should those be in catalog tables? */
+	Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
+
+	/* xid invalid or aborted */
+	if (tuple->t_infomask & HEAP_XMAX_INVALID)
+		goto visible;
+	/* locked tuples are always visible */
+	else if (tuple->t_infomask & HEAP_IS_LOCKED)
+		goto visible;
+    /* check if its one of our txids, toplevel is also in there */
+	else if (TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt))
+	{
+		CommandId cmin;
+		CommandId cmax = HeapTupleHeaderGetRawCommandId(tuple);
+
+		/* Lookup actual cmin/cmax values */
+		if (tuple->t_infomask & HEAP_COMBOCID){
+#ifdef DEBUG_ME
+			CommandId combocid = cmax;
+#endif
+			bool resolved;
+
+			resolved = ResolveCminCmaxDuringDecoding(tuplecid_data, htup,
+													 buffer, &cmin, &cmax);
+
+			if (!resolved)
+			{
+				elog(FATAL, "could not resolve combocid to cmax");
+				goto invisible;
+			}
+
+
+#ifdef DEBUG_ME
+			elog(LOG, "converting combocid %u to cmax %u (cmin %u)",
+				 combocid, cmax, cmin);
+#endif
+		}
+#ifdef DEBUG_ME
+		elog(LOG, "curcid: %u, cmax %u, visible: %u", snapshot->curcid, cmax,
+			 cmax >= snapshot->curcid);
+#endif
+		if (cmax >= snapshot->curcid)
+			goto visible;	/* deleted after scan started */
+		else
+			goto invisible;	/* deleted before scan started */
+	}
+	/* we cannot possibly see the deleting transaction */
+	else if (TransactionIdFollows(xmax, snapshot->xmax))
+	{
+		goto visible;
+	}
+	/* normal transaction state is valid */
+	else if (TransactionIdPrecedes(xmax, snapshot->xmin))
+	{
+		Assert(!(tuple->t_infomask & HEAP_XMAX_COMMITTED &&
+				 !TransactionIdDidCommit(xmax)));
+
+		if (tuple->t_infomask & HEAP_XMAX_COMMITTED)
+			goto invisible;
+
+		if (TransactionIdDidCommit(xmax))
+			goto invisible;
+		else
+			goto visible;
+	}
+	/* do we know that the deleting txn is valid? */
+	else if (TransactionIdInArray(xmax, snapshot->xip, snapshot->xcnt))
+	{
+		goto invisible;
+	}
+	else
+	{
+		goto visible;
+	}
+visible:
+#ifdef DEBUG_ME
+	if (xmin > FirstNormalTransactionId)
+		elog(DEBUG1, "visible tuple with xmin: %u, xmax: %u, cmin %u, snapmin: %u, snapmax: %u, snapcid: %u, owncnt: %u top: %u, combo: %u, (%u, %u)",
+			 xmin, xmax, HeapTupleHeaderGetRawCommandId(tuple),
+			 snapshot->xmin, snapshot->xmax, snapshot->curcid,
+			 snapshot->subxcnt, snapshot->subxcnt ? snapshot->subxip[0] : 0,
+			 !!(tuple->t_infomask & HEAP_COMBOCID),
+			 BlockIdGetBlockNumber(&htup->t_self.ip_blkid), htup->t_self.ip_posid);
+#endif
+	return true;
+
+invisible:
+#ifdef DEBUG_ME
+	if (xmin > FirstNormalTransactionId)
+		elog(DEBUG1, "invisible tuple with xmin: %u, xmax: %u, cmin %u, snapmin: %u, snapmax: %u, snapcid: %u, owncnt: %u top: %u, combo: %u, (%u, %u)",
+			 xmin, xmax, HeapTupleHeaderGetRawCommandId(tuple),
+			 snapshot->xmin, snapshot->xmax, snapshot->curcid,
+			 snapshot->subxcnt, snapshot->subxcnt ? snapshot->subxip[0] : 0,
+			 !!(tuple->t_infomask & HEAP_COMBOCID),
+			 BlockIdGetBlockNumber(&htup->t_self.ip_blkid), htup->t_self.ip_posid);
+#endif
+	return false;
+}
+
+static bool
+FailsSatisfies(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+	elog(ERROR, "Normal snapshots cannot be used during timetravel access.");
+	return false;
+}
+
+static bool
+RedirectSatisfiesNow(HeapTuple htup, Snapshot snapshot, Buffer buffer)
+{
+	Assert(SnapshotNowDecoding != NULL);
+	return HeapTupleSatisfiesMVCCDuringDecoding(htup, SnapshotNowDecoding,
+	                                            buffer);
+}
+
+void
+SetupDecodingSnapshots(Snapshot snapshot_now, HTAB *tuplecids)
+{
+	/* prevent recursively setting up decoding snapshots */
+	Assert(SnapshotNowData.satisfies != RedirectSatisfiesNow);
+
+	SnapshotNowData.satisfies = RedirectSatisfiesNow;
+	/* make sure normal snapshots aren't used*/
+	SnapshotSelfData.satisfies = FailsSatisfies;
+	SnapshotAnyData.satisfies = FailsSatisfies;
+	SnapshotToastData.satisfies = FailsSatisfies;
+
+	/* setup the timetravel snapshot */
+	SnapshotNowDecoding = snapshot_now;
+
+	/* setup (cmin, cmax) lookup hash */
+	tuplecid_data = tuplecids;
+}
+
+
+void
+RevertFromDecodingSnapshots(void)
+{
+	SnapshotNowDecoding = NULL;
+	tuplecid_data = NULL;
+
+	/* rally to restore sanity and/or boredom */
+	SnapshotNowData.satisfies = HeapTupleSatisfiesNow;
+	SnapshotSelfData.satisfies = HeapTupleSatisfiesSelf;
+	SnapshotAnyData.satisfies = HeapTupleSatisfiesAny;
+	SnapshotToastData.satisfies = HeapTupleSatisfiesToast;
+}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 129c4d0..10080d0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -77,6 +77,8 @@ wal_level_str(WalLevel wal_level)
 			return "archive";
 		case WAL_LEVEL_HOT_STANDBY:
 			return "hot_standby";
+		case WAL_LEVEL_LOGICAL:
+			return "logical";
 	}
 	return _("unrecognized wal_level");
 }
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8ec710e..1405317 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -54,6 +54,7 @@
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
+#define XLOG_HEAP2_NEW_CID 0x60
 
 /*
  * All what we need to find changed tuple
@@ -238,6 +239,28 @@ typedef struct xl_heap_visible
 
 #define SizeOfHeapVisible (offsetof(xl_heap_visible, cutoff_xid) + sizeof(TransactionId))
 
+typedef struct xl_heap_new_cid
+{
+	/*
+	 * store toplevel xid so we don't have to merge cids from different
+	 * transactions
+	 */
+	TransactionId top_xid;
+	CommandId cmin;
+	CommandId cmax;
+	/*
+	 * don't really need the combocid but the padding makes it free and its
+	 * useful for debugging.
+	 */
+	CommandId combocid;
+	/*
+	 * Store the relfilenode/ctid pair to facilitate lookups.
+	 */
+	xl_heaptid target;
+} xl_heap_new_cid;
+
+#define SizeOfHeapNewCid (offsetof(xl_heap_new_cid, target) + SizeOfHeapTid)
+
 extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 									   TransactionId *latestRemovedXid);
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 228f6a1..915b2cd 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -63,6 +63,11 @@
 	(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
 	(int32) ((id1) - (id2)) < 0)
 
+/* compare two XIDs already known to be normal; this is a macro for speed */
+#define NormalTransactionIdFollows(id1, id2) \
+	(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
+	(int32) ((id1) - (id2)) > 0)
+
 /* ----------
  *		Object ID (OID) zero is InvalidOid.
  *
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 32c2e40..ae4f849 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -196,7 +196,8 @@ typedef enum WalLevel
 {
 	WAL_LEVEL_MINIMAL = 0,
 	WAL_LEVEL_ARCHIVE,
-	WAL_LEVEL_HOT_STANDBY
+	WAL_LEVEL_HOT_STANDBY,
+	WAL_LEVEL_LOGICAL
 } WalLevel;
 extern int	wal_level;
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 298641b..8848fd2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -103,4 +103,8 @@ extern bool ReindexIsProcessingHeap(Oid heapOid);
 extern bool ReindexIsProcessingIndex(Oid indexOid);
 extern Oid	IndexGetRelation(Oid indexId, bool missing_ok);
 
+extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+                                   int16 *nratts, int16 *attnums, Oid *atttypids,
+                                   Oid *opclasses);
+
 #endif   /* INDEX_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 438a1d9..223849b 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -407,6 +407,8 @@ typedef enum NodeTag
 	T_IdentifySystemCmd,
 	T_BaseBackupCmd,
 	T_StartReplicationCmd,
+	T_InitLogicalReplicationCmd,
+	T_StartLogicalReplicationCmd,
 
 	/*
 	 * TAGS FOR RANDOM OTHER STUFF
diff --git a/src/include/nodes/replnodes.h b/src/include/nodes/replnodes.h
index 236a36d..91a4986 100644
--- a/src/include/nodes/replnodes.h
+++ b/src/include/nodes/replnodes.h
@@ -49,4 +49,26 @@ typedef struct StartReplicationCmd
 	XLogRecPtr	startpoint;
 } StartReplicationCmd;
 
+
+/* ----------------------
+ *		INIT_LOGICAL_REPLICATION command
+ * ----------------------
+ */
+typedef struct InitLogicalReplicationCmd
+{
+	NodeTag		type;
+	char       *plugin;
+} InitLogicalReplicationCmd;
+
+/* ----------------------
+ *		START_LOGICAL_REPLICATION command
+ * ----------------------
+ */
+typedef struct StartLogicalReplicationCmd
+{
+	NodeTag		type;
+	char       *name;
+	XLogRecPtr	startpoint;
+} StartLogicalReplicationCmd;
+
 #endif   /* REPLNODES_H */
diff --git a/src/include/replication/decode.h b/src/include/replication/decode.h
new file mode 100644
index 0000000..1caa98d
--- /dev/null
+++ b/src/include/replication/decode.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ * decode.h
+ *     PostgreSQL WAL to logical transformation
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DECODE_H
+#define DECODE_H
+
+#include "access/xlogreader.h"
+#include "replication/reorderbuffer.h"
+#include "replication/logicalfuncs.h"
+
+void DecodeRecordIntoReorderBuffer(XLogReaderState *reader,
+								   ReaderApplyState* state,
+								   XLogRecordBuffer* buf);
+
+#endif
diff --git a/src/include/replication/logicalfuncs.h b/src/include/replication/logicalfuncs.h
new file mode 100644
index 0000000..db78797
--- /dev/null
+++ b/src/include/replication/logicalfuncs.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ * decode.h
+ *     PostgreSQL WAL to logical transformation
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LOGICALFUNCS_H
+#define LOGICALFUNCS_H
+
+#include "access/xlogreader.h"
+#include "replication/reorderbuffer.h"
+#include "replication/output_plugin.h"
+
+typedef struct ReaderApplyState
+{
+	struct ReorderBuffer *reorderbuffer;
+	bool stop_after_consistent;
+	struct Snapstate *snapstate;
+
+	LogicalDecodeInitCB init_cb;
+	LogicalDecodeBeginCB begin_cb;
+	LogicalDecodeChangeCB change_cb;
+	LogicalDecodeCommitCB commit_cb;
+	LogicalDecodeCleanupCB cleanup_cb;
+
+	StringInfo out;
+	void *user_private;
+
+
+} ReaderApplyState;
+
+XLogReaderState *
+initial_snapshot_reader(void);
+
+XLogReaderState *
+normal_snapshot_reader(char *plugin, XLogRecPtr valid_after);
+
+bool
+initial_snapshot_ready(XLogReaderState *);
+
+#endif
diff --git a/src/include/replication/output_plugin.h b/src/include/replication/output_plugin.h
new file mode 100644
index 0000000..27d5982
--- /dev/null
+++ b/src/include/replication/output_plugin.h
@@ -0,0 +1,76 @@
+/*-------------------------------------------------------------------------
+ * output_plugin.h
+ *     PostgreSQL Logical Decode Plugin Interface
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef OUTPUT_PLUGIN_H
+#define OUTPUT_PLUGIN_H
+
+#include "lib/stringinfo.h"
+
+#include "replication/reorderbuffer.h"
+
+/*
+ * Callback that gets called in a user-defined plugin.
+ * 'private' can be set to some private data.
+ *
+ * Gets looked up in the library symbol pg_decode_init.
+ */
+typedef void (*LogicalDecodeInitCB) (
+	void **private
+	);
+
+/*
+ * Gets called for every BEGIN of a successful transaction.
+ *
+ * Return "true" if the message in "out" should get sent, false otherwise.
+ *
+ * Gets looked up in the library symbol pg_decode_begin_txn.
+ */
+typedef bool (*LogicalDecodeBeginCB) (
+	void *private,
+	StringInfo out,
+	ReorderBufferTXN *txn);
+
+/*
+ * Gets called for every change in a successful transaction.
+ *
+ * Return "true" if the message in "out" should get sent, false otherwise.
+ *
+ * Gets looked up in the library symbol pg_decode_change.
+ */
+typedef bool (*LogicalDecodeChangeCB) (
+	void *private,
+	StringInfo out,
+	ReorderBufferTXN *txn,
+	Oid tableoid,
+	ReorderBufferChange *change
+	);
+
+/*
+ * Gets called for every COMMIT of a successful transaction.
+ *
+ * Return "true" if the message in "out" should get sent, false otherwise.
+ *
+ * Gets looked up in the library symbol pg_decode_commit_txn.
+ */
+typedef bool (*LogicalDecodeCommitCB) (
+	void *private,
+	StringInfo out,
+	ReorderBufferTXN *txn,
+	XLogRecPtr commit_lsn);
+
+/*
+ * Gets called to cleanup the state of an output plugin
+ *
+ * Gets looked up in the library symbol pg_decode_cleanup.
+ */
+typedef void (*LogicalDecodeCleanupCB) (
+	void *private
+	);
+
+#endif
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
new file mode 100644
index 0000000..a79dd79
--- /dev/null
+++ b/src/include/replication/reorderbuffer.h
@@ -0,0 +1,284 @@
+/*
+ * reorderbuffer.h
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/replication/reorderbuffer.h
+ */
+#ifndef REORDERBUFFER_H
+#define REORDERBUFFER_H
+
+#include "access/htup_details.h"
+#include "utils/hsearch.h"
+
+#include "lib/ilist.h"
+
+#include "storage/sinval.h"
+
+#include "utils/snapshot.h"
+
+
+typedef struct ReorderBuffer ReorderBuffer;
+
+enum ReorderBufferChangeType
+{
+	REORDER_BUFFER_CHANGE_INSERT,
+	REORDER_BUFFER_CHANGE_UPDATE,
+	REORDER_BUFFER_CHANGE_DELETE
+};
+
+typedef struct ReorderBufferTupleBuf
+{
+	/* position in preallocated list */
+	slist_node node;
+
+	HeapTupleData tuple;
+	HeapTupleHeaderData header;
+	char		data[MaxHeapTupleSize];
+}	ReorderBufferTupleBuf;
+
+typedef struct ReorderBufferChange
+{
+	XLogRecPtr	lsn;
+
+	union {
+		enum ReorderBufferChangeType action;
+		/* do not leak internal enum values to the outside */
+		int action_internal;
+	};
+
+	RelFileNode relnode;
+
+	union
+	{
+		ReorderBufferTupleBuf *newtuple;
+		Snapshot	snapshot;
+		CommandId	command_id;
+		struct {
+			RelFileNode node;
+			ItemPointerData tid;
+			CommandId	cmin;
+			CommandId	cmax;
+			CommandId	combocid;
+		} tuplecid;
+	};
+
+	ReorderBufferTupleBuf *oldtuple;
+
+	/*
+	 * While in use this is how a change is linked into a transactions,
+	 * otherwise its the preallocated list.
+	 */
+	dlist_node node;
+} ReorderBufferChange;
+
+typedef struct ReorderBufferTXN
+{
+	TransactionId xid;
+
+	XLogRecPtr	lsn;
+
+	/* did the TX have catalog changes */
+	bool does_timetravel;
+
+	bool has_base_snapshot;
+
+	/*
+	 * How many ReorderBufferChange's do we have in this txn.
+	 *
+	 * Subtransactions are *not* included.
+	 */
+	Size		nentries;
+
+	/*
+	 * How many of the above entries are stored in memory in contrast to being
+	 * spilled to disk.
+	 */
+	Size		nentries_mem;
+
+	/*
+	 * List of actual changes, those include new Snapshots and new CommandIds
+	 */
+	dlist_head changes;
+
+	/*
+	 * List of cmin/cmax pairs for catalog tuples
+	 */
+	dlist_head tuplecids;
+
+	/*
+	 * Numer of stored cmin/cmax pairs. Used to create the tuplecid_hash with
+	 * the correct size.
+	 */
+	Size       ntuplecids;
+
+	/*
+	 * On-demand built hash for looking up the above values.
+	 */
+	HTAB	   *tuplecid_hash;
+
+	/*
+	 * non-hierarchical list of subtransactions that are *not* aborted
+	 */
+	dlist_head subtxns;
+	Size nsubtxns;
+
+	/*
+	 * our position in a list of subtransactions while the TXN is in use.
+	 * Otherwise its the position in the list of preallocated transactions.
+	 */
+	dlist_node node;
+
+	/*
+	 * Number of stored cache invalidations.
+	 */
+	Size ninvalidations;
+
+	/*
+	 * Stored cache invalidations. This is not a linked list because we get all
+	 * the invalidations at once.
+	 */
+	SharedInvalidationMessage *invalidations;
+
+} ReorderBufferTXN;
+
+
+/* XXX: were currently passing the originating subtxn. Not sure thats necessary */
+typedef void (*ReorderBufferApplyChangeCB) (
+	ReorderBuffer *cache,
+	ReorderBufferTXN *txn,
+	ReorderBufferChange *change);
+
+typedef void (*ReorderBufferBeginCB) (
+	ReorderBuffer *cache,
+	ReorderBufferTXN *txn);
+
+typedef void (*ReorderBufferCommitCB) (
+	ReorderBuffer *cache,
+	ReorderBufferTXN *txn,
+	XLogRecPtr commit_lsn);
+
+/*
+ * max number of concurrent top-level transactions or transaction where we
+ * don't know if they are top-level can be calculated by:
+ * (max_connections + max_prepared_xactx + ?)  * PGPROC_MAX_CACHED_SUBXIDS
+ */
+struct ReorderBuffer
+{
+	/*
+	 * Should snapshots for decoding be collected. If many catalog changes
+	 * happen this can be considerably expensive.
+	 */
+	bool		build_snapshots;
+
+	TransactionId last_txn;
+	ReorderBufferTXN *last_txn_cache;
+	HTAB	   *by_txn;
+
+	ReorderBufferBeginCB begin;
+	ReorderBufferApplyChangeCB apply_change;
+	ReorderBufferCommitCB commit;
+
+	void	   *private_data;
+
+	MemoryContext context;
+
+	/*
+	 * we don't want to repeatedly (de-)allocated those structs, so cache them
+	 * for reusage.
+	 */
+	dlist_head cached_transactions;
+	Size		nr_cached_transactions;
+
+	dlist_head cached_changes;
+	Size		nr_cached_changes;
+
+	slist_head cached_tuplebufs;
+	Size		nr_cached_tuplebufs;
+};
+
+
+ReorderBuffer *ReorderBufferAllocate(void);
+
+void ReorderBufferFree(ReorderBuffer *);
+
+ReorderBufferTupleBuf *ReorderBufferGetTupleBuf(ReorderBuffer *);
+
+void ReorderBufferReturnTupleBuf(ReorderBuffer *cache, ReorderBufferTupleBuf * tuple);
+
+/*
+ * Returns a (potentically preallocated) change struct. Its lifetime is managed
+ * by the reorderbuffer module.
+ *
+ * If not added to a transaction with ReorderBufferAddChange it needs to be
+ * returned via ReorderBufferReturnChange
+ *
+ * FIXME: better name
+ */
+ReorderBufferChange *
+			ReorderBufferGetChange(ReorderBuffer *);
+
+/*
+ * Return an unused ReorderBufferChange struct
+ */
+void ReorderBufferReturnChange(ReorderBuffer *, ReorderBufferChange *);
+
+
+/*
+ * record the transaction as in-progress if not already done, add the current
+ * change.
+ *
+ * We have a one-entry cache for lookin up the current ReorderBufferTXN so we
+ * don't need to do a full hash-lookup if the same xid is used
+ * sequentially. Them being used multiple times that way is rather frequent.
+ */
+void ReorderBufferAddChange(ReorderBuffer *, TransactionId, XLogRecPtr lsn, ReorderBufferChange *);
+
+/*
+ *
+ */
+void ReorderBufferCommit(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+
+void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr lsn);
+
+void ReorderBufferAbort(ReorderBuffer *, TransactionId, XLogRecPtr lsn);
+
+/*
+ * if lsn == InvalidXLogRecPtr this is the first snap for the transaction
+ *
+ * most callers don't need snapshot.h, so we use struct SnapshotData instead
+ */
+void ReorderBufferAddBaseSnapshot(ReorderBuffer *, TransactionId, XLogRecPtr lsn, struct SnapshotData *snap);
+
+/*
+ * Will only be called for command ids > 1
+ */
+void ReorderBufferAddNewCommandId(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  CommandId cid);
+
+void ReorderBufferAddNewTupleCids(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								  RelFileNode node, ItemPointerData pt,
+								  CommandId cmin, CommandId cmax, CommandId combocid);
+
+void ReorderBufferAddInvalidations(ReorderBuffer *, TransactionId, XLogRecPtr lsn,
+								   Size nmsgs, SharedInvalidationMessage* msgs);
+
+bool ReorderBufferIsXidKnown(ReorderBuffer *cache, TransactionId xid);
+
+/*
+ * Announce that tx does timetravel. Relevant for the whole toplevel/subtxn
+ * tree.
+ */
+void ReorderBufferXidSetTimetravel(ReorderBuffer *cache, TransactionId xid);
+
+/*
+ * Does the transaction indicated by 'xid' do timetravel?
+ */
+bool ReorderBufferXidDoesTimetravel(ReorderBuffer *cache, TransactionId xid);
+
+bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *cache, TransactionId xid);
+
+#endif
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
new file mode 100644
index 0000000..e8c5fcb
--- /dev/null
+++ b/src/include/replication/snapbuild.h
@@ -0,0 +1,128 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.h
+ *	  Exports from replication/logical/snapbuild.c.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/replication/snapbuild.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPBUILD_H
+#define SNAPBUILD_H
+
+#include "replication/reorderbuffer.h"
+
+#include "utils/hsearch.h"
+#include "utils/snapshot.h"
+#include "access/htup.h"
+
+typedef enum
+{
+	/*
+	 * Initial state, we can't do much yet.
+	 */
+	SNAPBUILD_START,
+
+	/*
+	 * We have collected enough information to decode tuples in transactions
+	 * that started after this.
+	 *
+	 * Once we reached this we start to collect changes. We cannot apply them
+	 * yet because the might be based on transactions that were still running
+	 * when we reached them yet.
+	 */
+	SNAPBUILD_FULL_SNAPSHOT,
+
+	/*
+	 * Found a point after hitting built_full_snapshot where all transactions
+	 * that were running at that point finished. Till we reach that we hold
+	 * off calling any commit callbacks.
+	 */
+	SNAPBUILD_CONSISTENT
+}	SnapBuildState;
+
+typedef enum
+{
+	SNAPBUILD_SKIP,
+	SNAPBUILD_DECODE
+}	SnapBuildAction;
+
+typedef struct Snapstate
+{
+	SnapBuildState state;
+
+	/* all transactions smaller than this have committed/aborted */
+	TransactionId xmin;
+
+	/* all transactions bigger than this are uncommitted */
+	TransactionId xmax;
+
+	/*
+	 * All transactions in this window have to be checked via the running
+	 * array. This will only be used initially till we are past xmax_running.
+	 *
+	 * Note that we initially assume treat already running transactions to
+	 * have catalog modifications because we don't have enough information
+	 * about them to properly judge that.
+	 */
+	TransactionId xmin_running;
+	TransactionId xmax_running;
+
+	/*
+	 * array of running transactions.
+	 *
+	 * Kept in xidComparator order so it can be searched with bsearch().
+	 */
+	TransactionId *running;
+	/* how many transactions are still running */
+	size_t		nrrunning;
+
+	/*
+	 * we need to keep track of the amount of tracked transactions separately
+	 * from nrrunning_space as nrunning_initial gives the range of valid xids
+	 * in the array so bsearch() can work.
+	 */
+	size_t		nrrunning_initial;
+
+	XLogRecPtr transactions_after;
+
+	/*
+	 * Transactions which could have catalog changes that committed between
+	 * xmin and xmax
+	 */
+	size_t		nrcommitted;
+	size_t		nrcommitted_space;
+	/*
+	 * Array of committed transactions that have modified the catalog.
+	 *
+	 * As this array is frequently modified we do *not* keep it in
+	 * xidComparator order. Instead we sort the array when building &
+	 * distributing a snapshot.
+	 */
+	TransactionId *committed;
+
+	/*
+	 * Snapshot thats valid to see all committed transactions that see catalog
+	 * modifications.
+	 */
+	Snapshot snapshot;
+}	Snapstate;
+
+extern Snapstate *AllocateSnapshotBuilder(ReorderBuffer *cache);
+
+extern void	FreeSnapshotBuilder(Snapstate *cache);
+
+struct XLogRecordBuffer;
+
+extern SnapBuildAction SnapBuildDecodeCallback(ReorderBuffer *cache, Snapstate *snapstate, struct XLogRecordBuffer *buf);
+
+extern HeapTuple LookupTableByRelFileNode(RelFileNode *r);
+
+extern bool SnapBuildHasCatalogChanges(Snapstate *snapstate, TransactionId xid,
+                                       RelFileNode *relfilenode);
+
+extern void SnapBuildSnapDecRefcount(Snapshot snap);
+
+#endif   /* SNAPBUILD_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index df8e951..b2d9434 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -24,6 +24,7 @@ extern bool wake_wal_senders;
 
 /* user-settable parameters */
 extern int	max_wal_senders;
+extern int	max_logical_slots;
 extern int	wal_sender_timeout;
 
 extern void InitWalSender(void);
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 66234cd..c712659 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -66,6 +66,28 @@ typedef struct WalSnd
 
 extern WalSnd *MyWalSnd;
 
+typedef struct
+{
+	TransactionId xmin;
+	NameData      name;
+	NameData      plugin;
+
+	XLogRecPtr	  last_required_checkpoint;
+	XLogRecPtr	  confirmed_flush;
+
+	TransactionId candidate_xmin;
+	XLogRecPtr	  candidate_xmin_after;
+
+	/* is this slot defined */
+	bool          in_use;
+	/* is somebody streaming out changes for this slot */
+	bool          active;
+	slock_t		mutex;
+} LogicalWalSnd;
+
+extern LogicalWalSnd *MyLogicalWalSnd;
+
+
 /* There is one WalSndCtl struct for the whole database cluster */
 typedef struct
 {
@@ -88,12 +110,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	TransactionId logical_xmin;
+
 	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
+	LogicalWalSnd logical_walsnds[1];		/* VARIABLE LENGTH ARRAY */
 } WalSndCtlData;
 
 extern WalSndCtlData *WalSndCtl;
 
-
 extern void WalSndSetState(WalSndState state);
 extern void XLogRead(char *buf, XLogRecPtr startptr, Size count);
 
@@ -109,4 +133,12 @@ extern void replication_scanner_finish(void);
 
 extern Node *replication_parse_result;
 
+/* change logical xmin */
+extern void IncreaseLogicalXminForSlot(XLogRecPtr lsn, TransactionId xmin);
+
+/* logical wal sender data gathering functions */
+extern void WalSndWriteData(StringInfo data);
+extern void WalSndPrepareWrite(StringInfo out, XLogRecPtr lsn);
+
+
 #endif   /* _WALSENDER_PRIVATE_H */
diff --git a/src/include/storage/itemptr.h b/src/include/storage/itemptr.h
index 331812b..fc5f86d 100644
--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -116,6 +116,9 @@ typedef ItemPointerData *ItemPointer;
 /*
  * ItemPointerCopy
  *		Copies the contents of one disk item pointer to another.
+ *
+ * Should there ever be padding in an ItemPointer this would need to be handled
+ * differently as its used in hashes.
  */
 #define ItemPointerCopy(fromPointer, toPointer) \
 ( \
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index bcf2c81..1b68bab 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -136,4 +136,6 @@ extern void ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs
 									 int nmsgs, bool RelcacheInitFileInval,
 									 Oid dbid, Oid tsid);
 
+extern void LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg);
+
 #endif   /* SINVAL_H */
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index b129ae9..2e9a7d8 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -39,7 +39,8 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
 
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
-	((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
+	((snapshot)->satisfies == HeapTupleSatisfiesMVCC || \
+	 (snapshot)->satisfies == HeapTupleSatisfiesMVCCDuringDecoding)
 
 /*
  * HeapTupleSatisfiesVisibility
@@ -89,4 +90,32 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid);
 
+/*
+ * Special "satisfies" routines used during decoding xlog from a different
+ * point of lsn. Also used for timetravel SnapshotNow's.
+ */
+extern bool HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup,
+                                                 Snapshot snapshot, Buffer buffer);
+
+/*
+ * install the 'snapshot_now' snapshot as a timetravelling snapshot replacing
+ * the normal SnapshotNow behaviour. This snapshot needs to have been created
+ * by snapbuild.c otherwise you will see crashes!
+ *
+ * FIXME: We need something resembling the real SnapshotNow to handle things
+ * like enum lookups from indices correctly.
+ */
+extern void SetupDecodingSnapshots(Snapshot snapshot_now, HTAB *tuplecids);
+extern void RevertFromDecodingSnapshots(void);
+
+/*
+ * resolve combocids and overwritten cmin values
+ *
+ * To avoid leaking to much knowledge about the reorderbuffer this is
+ * implemented in reorderbuffer.c not tqual.c.
+ */
+extern bool ResolveCminCmaxDuringDecoding(HTAB *tuplecid_data, HeapTuple htup,
+										  Buffer buffer,
+										  CommandId *cmin, CommandId *cmax);
+
 #endif   /* TQUAL_H */

#13

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 12/14] Add a simple decoding module in contrib named 'test_decoding'

---
contrib/Makefile | 1 +
contrib/test_decoding/Makefile | 16 +++
contrib/test_decoding/test_decoding.c | 192 ++++++++++++++++++++++++++++++++++
3 files changed, 209 insertions(+)
create mode 100644 contrib/test_decoding/Makefile
create mode 100644 contrib/test_decoding/test_decoding.c

Attachments:

0012-Add-a-simple-decoding-module-in-contrib-named-test_d.patchtext/x-patch; name=0012-Add-a-simple-decoding-module-in-contrib-named-test_d.patchDownload

diff --git a/contrib/Makefile b/contrib/Makefile
index d230451..8709992 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
 		tablefunc	\
 		tcn		\
 		test_parser	\
+		test_decoding	\
 		tsearch2	\
 		unaccent	\
 		vacuumlo
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
new file mode 100644
index 0000000..2ac9653
--- /dev/null
+++ b/contrib/test_decoding/Makefile
@@ -0,0 +1,16 @@
+# contrib/test_decoding/Makefile
+
+MODULE_big = test_decoding
+OBJS = test_decoding.o
+
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_decoding
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
new file mode 100644
index 0000000..f3d90e3
--- /dev/null
+++ b/contrib/test_decoding/test_decoding.c
@@ -0,0 +1,192 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_deocding.c
+ *		  example output plugin for the logical replication functionality
+ *
+ * Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  contrib/test_decoding/test_decoding.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "catalog/index.h"
+
+#include "replication/output_plugin.h"
+#include "replication/snapbuild.h"
+
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/relcache.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+
+PG_MODULE_MAGIC;
+
+void _PG_init(void);
+
+void WalSndWriteData(XLogRecPtr lsn, const char *data, Size len);
+
+extern void pg_decode_init(void **private);
+
+extern bool pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn);
+extern bool pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn);
+extern bool pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn, Oid tableoid, ReorderBufferChange *change);
+
+
+void
+_PG_init(void)
+{
+}
+
+void
+pg_decode_init(void **private)
+{
+	AssertVariableIsOfType(&pg_decode_init, LogicalDecodeInitCB);
+	*private = AllocSetContextCreate(TopMemoryContext,
+									 "text conversion context",
+									 ALLOCSET_DEFAULT_MINSIZE,
+									 ALLOCSET_DEFAULT_INITSIZE,
+									 ALLOCSET_DEFAULT_MAXSIZE);
+}
+
+bool
+pg_decode_begin_txn(void *private, StringInfo out, ReorderBufferTXN* txn)
+{
+	AssertVariableIsOfType(&pg_decode_begin_txn, LogicalDecodeBeginCB);
+
+	appendStringInfo(out, "BEGIN %d", txn->xid);
+	return true;
+}
+
+bool
+pg_decode_commit_txn(void *private, StringInfo out, ReorderBufferTXN* txn, XLogRecPtr commit_lsn)
+{
+	AssertVariableIsOfType(&pg_decode_commit_txn, LogicalDecodeCommitCB);
+
+	appendStringInfo(out, "COMMIT %d", txn->xid);
+	return true;
+}
+
+static void
+tuple_to_stringinfo(StringInfo s, TupleDesc tupdesc, HeapTuple tuple)
+{
+	int			i;
+	HeapTuple	typeTuple;
+	Form_pg_type pt;
+
+	for (i = 0; i < tupdesc->natts; i++)
+	{
+		Oid			typid, typoutput;
+		bool		typisvarlena;
+		Datum		origval, val;
+		char        *outputstr;
+		bool        isnull;
+		if (tupdesc->attrs[i]->attisdropped)
+			continue;
+		if (tupdesc->attrs[i]->attnum < 0)
+			continue;
+
+		typid = tupdesc->attrs[i]->atttypid;
+
+		typeTuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+		if (!HeapTupleIsValid(typeTuple))
+			elog(ERROR, "cache lookup failed for type %u", typid);
+		pt = (Form_pg_type) GETSTRUCT(typeTuple);
+
+		appendStringInfoChar(s, ' ');
+		appendStringInfoString(s, NameStr(tupdesc->attrs[i]->attname));
+		appendStringInfoChar(s, '[');
+		appendStringInfoString(s, NameStr(pt->typname));
+		appendStringInfoChar(s, ']');
+
+		getTypeOutputInfo(typid,
+						  &typoutput, &typisvarlena);
+
+		ReleaseSysCache(typeTuple);
+
+		origval = fastgetattr(tuple, i + 1, tupdesc, &isnull);
+
+		if (typisvarlena && !isnull)
+			val = PointerGetDatum(PG_DETOAST_DATUM(origval));
+		else
+			val = origval;
+
+		if (isnull)
+			outputstr = "(null)";
+		else
+			outputstr = OidOutputFunctionCall(typoutput, val);
+
+		appendStringInfoChar(s, ':');
+		appendStringInfoString(s, outputstr);
+	}
+}
+
+/* This is is just for demonstration, don't ever use this code for anything real! */
+bool
+pg_decode_change(void *private, StringInfo out, ReorderBufferTXN* txn,
+				 Oid tableoid, ReorderBufferChange *change)
+{
+	Relation relation = RelationIdGetRelation(tableoid);
+	Form_pg_class class_form = RelationGetForm(relation);
+	TupleDesc	tupdesc = RelationGetDescr(relation);
+	MemoryContext context = (MemoryContext)private;
+	MemoryContext old = MemoryContextSwitchTo(context);
+
+	AssertVariableIsOfType(&pg_decode_change, LogicalDecodeChangeCB);
+
+	appendStringInfoString(out, "table \"");
+	appendStringInfoString(out, NameStr(class_form->relname));
+	appendStringInfoString(out, "\":");
+
+	switch (change->action)
+	{
+	case REORDER_BUFFER_CHANGE_INSERT:
+		appendStringInfoString(out, " INSERT:");
+		tuple_to_stringinfo(out, tupdesc, &change->newtuple->tuple);
+		break;
+	case REORDER_BUFFER_CHANGE_UPDATE:
+		appendStringInfoString(out, " UPDATE:");
+		tuple_to_stringinfo(out, tupdesc, &change->newtuple->tuple);
+		break;
+	case REORDER_BUFFER_CHANGE_DELETE:
+		{
+			Oid indexoid = InvalidOid;
+			Relation indexrel;
+			TupleDesc	indexdesc;
+
+			int16 pknratts;
+			int16 pkattnum[INDEX_MAX_KEYS];
+			Oid pktypoid[INDEX_MAX_KEYS];
+			Oid pkopclass[INDEX_MAX_KEYS];
+
+			MemSet(pkattnum, 0, sizeof(pkattnum));
+			MemSet(pktypoid, 0, sizeof(pktypoid));
+			MemSet(pkopclass, 0, sizeof(pkopclass));
+
+			appendStringInfoString(out, " DELETE (pkey):");
+
+			relationFindPrimaryKey(relation, &indexoid, &pknratts,
+								   pkattnum, pktypoid, pkopclass);
+			indexrel = RelationIdGetRelation(indexoid);
+
+			indexdesc = RelationGetDescr(indexrel);
+
+			tuple_to_stringinfo(out, indexdesc, &change->oldtuple->tuple);
+
+			RelationClose(indexrel);
+			break;
+		}
+	}
+	RelationClose(relation);
+
+	MemoryContextSwitchTo(old);
+	MemoryContextReset(context);
+	return true;
+}

#14

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes

---
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_receivellog.c | 717 +++++++++++++++++++++++++++++++++
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +
4 files changed, 725 insertions(+), 3 deletions(-)
create mode 100644 src/bin/pg_basebackup/pg_receivellog.c

Attachments:

0013-Introduce-pg_receivellog-the-pg_receivexlog-equivale.patchtext/x-patch; name=0013-Introduce-pg_receivellog-the-pg_receivexlog-equivale.patchDownload

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5a2a46a..3775c44 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -20,7 +20,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 
 OBJS=receivelog.o streamutil.o $(WIN32RES)
 
-all: pg_basebackup pg_receivexlog
+all: pg_basebackup pg_receivexlog pg_receivellog
 
 pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport
 	$(CC) $(CFLAGS) pg_basebackup.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
@@ -28,6 +28,9 @@ pg_basebackup: pg_basebackup.o $(OBJS) | submake-libpq submake-libpgport
 pg_receivexlog: pg_receivexlog.o $(OBJS) | submake-libpq submake-libpgport
 	$(CC) $(CFLAGS) pg_receivexlog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
+pg_receivellog: pg_receivellog.o $(OBJS) | submake-libpq submake-libpgport
+	$(CC) $(CFLAGS) pg_receivellog.o $(OBJS) $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
 install: all installdirs
 	$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
 	$(INSTALL_PROGRAM) pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
@@ -40,4 +43,4 @@ uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
 
 clean distclean maintainer-clean:
-	rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o
+	rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
diff --git a/src/bin/pg_basebackup/pg_receivellog.c b/src/bin/pg_basebackup/pg_receivellog.c
new file mode 100644
index 0000000..1a95991
--- /dev/null
+++ b/src/bin/pg_basebackup/pg_receivellog.c
@@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_receivellog.c - receive streaming logical log data and write it
+ *					  to a local file.
+ *
+ * Author: Magnus Hagander <magnus@hagander.net>
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_basebackup/pg_receivellog.c
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * We have to use postgres.h not postgres_fe.h here, because there's so much
+ * backend-only stuff in the XLOG include files we need.  But we need a
+ * frontend-ish environment otherwise.	Hence this ugly hack.
+ */
+#define FRONTEND 1
+#include "postgres.h"
+#include "libpq-fe.h"
+#include "libpq/pqsignal.h"
+#include "access/xlog_internal.h"
+#include "utils/datetime.h"
+#include "utils/timestamp.h"
+
+#include "receivelog.h"
+#include "streamutil.h"
+
+#include <dirent.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "getopt_long.h"
+
+/* Time to sleep between reconnection attempts */
+#define RECONNECT_SLEEP_TIME 5
+
+/* Global options */
+char	   *outfile = NULL;
+int	        outfd = -1;
+int			verbose = 0;
+int			noloop = 0;
+int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
+volatile bool time_to_abort = false;
+
+
+static void usage(void);
+static void StreamLog();
+
+static void
+usage(void)
+{
+	printf(_("%s receives PostgreSQL streaming transaction logs.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -f, --file=FILE        receive log into this file\n"));
+	printf(_("  -n, --no-loop          do not loop on connection lost\n"));
+	printf(_("  -v, --verbose          output verbose messages\n"));
+	printf(_("  -V, --version          output version information, then exit\n"));
+	printf(_("  -?, --help             show this help, then exit\n"));
+	printf(_("\nConnection options:\n"));
+	printf(_("  -h, --host=HOSTNAME    database server host or socket directory\n"));
+	printf(_("  -p, --port=PORT        database server port number\n"));
+	printf(_("  -d, --database=DBNAME  database to connect to\n"));
+	printf(_("  -s, --status-interval=INTERVAL\n"
+			 "                         time between status packets sent to server (in seconds)\n"));
+	printf(_("  -U, --username=NAME    connect as specified database user\n"));
+	printf(_("  -w, --no-password      never prompt for password\n"));
+	printf(_("  -W, --password         force password prompt (should happen automatically)\n"));
+	printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));
+}
+
+
+/*
+ * Local version of GetCurrentTimestamp(), since we are not linked with
+ * backend code. The protocol always uses integer timestamps, regardless of
+ * server setting.
+ */
+static int64
+localGetCurrentTimestamp(void)
+{
+	int64 result;
+	struct timeval tp;
+
+	gettimeofday(&tp, NULL);
+
+	result = (int64) tp.tv_sec -
+		((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
+
+	result = (result * USECS_PER_SEC) + tp.tv_usec;
+
+	return result;
+}
+
+/*
+ * Local version of TimestampDifference(), since we are not linked with
+ * backend code.
+ */
+static void
+localTimestampDifference(int64 start_time, int64 stop_time,
+						 long *secs, int *microsecs)
+{
+	int64 diff = stop_time - start_time;
+
+	if (diff <= 0)
+	{
+		*secs = 0;
+		*microsecs = 0;
+	}
+	else
+	{
+		*secs = (long) (diff / USECS_PER_SEC);
+		*microsecs = (int) (diff % USECS_PER_SEC);
+	}
+}
+
+/*
+ * Local version of TimestampDifferenceExceeds(), since we are not
+ * linked with backend code.
+ */
+static bool
+localTimestampDifferenceExceeds(int64 start_time,
+								int64 stop_time,
+								int msec)
+{
+	int64 diff = stop_time - start_time;
+
+	return (diff >= msec * INT64CONST(1000));
+}
+
+/*
+ * Converts an int64 to network byte order.
+ */
+static void
+sendint64(int64 i, char *buf)
+{
+	uint32		n32;
+
+	/* High order half first, since we're doing MSB-first */
+	n32 = (uint32) (i >> 32);
+	n32 = htonl(n32);
+	memcpy(&buf[0], &n32, 4);
+
+	/* Now the low order half */
+	n32 = (uint32) i;
+	n32 = htonl(n32);
+	memcpy(&buf[4], &n32, 4);
+}
+
+/*
+ * Converts an int64 from network byte order to native format.
+ */
+static int64
+recvint64(char *buf)
+{
+	int64		result;
+	uint32		h32;
+	uint32		l32;
+
+	memcpy(&h32, buf, 4);
+	memcpy(&l32, buf + 4, 4);
+	h32 = ntohl(h32);
+	l32 = ntohl(l32);
+
+	result = h32;
+	result <<= 32;
+	result |= l32;
+
+	return result;
+}
+
+/*
+ * Send a Standby Status Update message to server.
+ */
+static bool
+sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
+{
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	int			len = 0;
+
+	replybuf[len] = 'r';
+	len += 1;
+	sendint64(blockpos, &replybuf[len]);			/* write */
+	len += 8;
+	sendint64(blockpos, &replybuf[len]);	/* flush */
+	len += 8;
+	sendint64(InvalidXLogRecPtr, &replybuf[len]);	/* apply */
+	len += 8;
+	sendint64(now, &replybuf[len]);					/* sendTime */
+	len += 8;
+	replybuf[len] = replyRequested ? 1 : 0;			/* replyRequested */
+	len += 1;
+
+	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
+	{
+		fprintf(stderr, _("%s: could not send feedback packet: %s"),
+				progname, PQerrorMessage(conn));
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Start the log streaming
+ */
+static void
+StreamLog(void)
+{
+	PGresult   *res;
+	char		query[128];
+	XLogRecPtr	startpos;
+	char       *id;
+	uint32		hi,
+				lo;
+	char	   *copybuf = NULL;
+	int64		last_status = -1;
+	XLogRecPtr	logoff = InvalidXLogRecPtr;
+
+	/*
+	 * Connect in replication mode to the server
+	 */
+	conn = GetConnection();
+	if (!conn)
+		/* Error message already written in GetConnection() */
+		return;
+
+	/*
+	 * Run IDENTIFY_SYSTEM so we can get the timeline and current xlog
+	 * position.
+	 */
+	res = PQexec(conn, "IDENTIFY_SYSTEM");
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
+				progname, "IDENTIFY_SYSTEM", PQerrorMessage(conn));
+		disconnect_and_exit(1);
+	}
+
+	if (PQntuples(res) != 1 || PQnfields(res) != 4)
+	{
+		fprintf(stderr,
+				_("%s: could not identify system: got %d rows and %d fields, expected %d rows and %d fields\n"),
+				progname, PQntuples(res), PQnfields(res), 1, 4);
+		disconnect_and_exit(1);
+	}
+	PQclear(res);
+
+	/*
+	 * init a replication slot
+	 */
+	if (verbose)
+		fprintf(stderr,
+				_("%s: init replication slot\n"),
+				progname);
+
+	res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");
+
+	if (PQntuples(res) != 1 || PQnfields(res) != 4)
+	{
+		fprintf(stderr,
+				_("%s: could not init logical rep: got %d rows and %d fields, expected %d rows and %d fields\n"),
+				progname, PQntuples(res), PQnfields(res), 1, 4);
+		goto error;
+	}
+
+	if (sscanf(PQgetvalue(res, 0, 1), "%X/%X", &hi, &lo) != 2)
+	{
+		fprintf(stderr,
+				_("%s: could not parse log location \"%s\"\n"),
+				progname, PQgetvalue(res, 0, 1));
+		goto error;
+	}
+	startpos = ((uint64) hi) << 32 | lo;
+
+	id = strdup(PQgetvalue(res, 0, 0));
+	PQclear(res);
+
+	/*
+	 * Start the replication
+	 */
+	if (verbose)
+		fprintf(stderr,
+				_("%s: starting log streaming at %X/%X (slot %s)\n"),
+				progname, (uint32) (startpos >> 32), (uint32) startpos,
+				id);
+
+	/* Initiate the replication stream at specified location */
+	snprintf(query, sizeof(query), "START_LOGICAL_REPLICATION '%s' %X/%X",
+			 id, (uint32) (startpos >> 32), (uint32) startpos);
+	res = PQexec(conn, query);
+	if (PQresultStatus(res) != PGRES_COPY_BOTH)
+	{
+		fprintf(stderr, _("%s: could not send replication command \"%s\": %s\n"),
+				progname, "START_LOGICAL_REPLICATION", PQresultErrorMessage(res));
+		PQclear(res);
+		goto error;
+	}
+	PQclear(res);
+
+	if (verbose)
+		fprintf(stderr,
+				_("%s: initiated streaming\n"),
+				progname);
+
+	while (!time_to_abort)
+	{
+		int			r;
+		int			bytes_left;
+		int			bytes_written;
+		int64		now;
+		int         hdr_len;
+
+		if (copybuf != NULL)
+		{
+			PQfreemem(copybuf);
+			copybuf = NULL;
+		}
+
+		/*
+		 * Potentially send a status message to the master
+		 */
+		now = localGetCurrentTimestamp();
+		if (standby_message_timeout > 0 &&
+			localTimestampDifferenceExceeds(last_status, now,
+											standby_message_timeout))
+		{
+			/* Time to send feedback! */
+			if (!sendFeedback(conn, logoff, now, false))
+				goto error;
+
+			last_status = now;
+		}
+
+		r = PQgetCopyData(conn, &copybuf, 1);
+		if (r == 0)
+		{
+			/*
+			 * In async mode, and no data available. We block on reading but
+			 * not more than the specified timeout, so that we can send a
+			 * response back to the client.
+			 */
+			fd_set		input_mask;
+			struct timeval timeout;
+			struct timeval *timeoutptr;
+
+			FD_ZERO(&input_mask);
+			FD_SET(PQsocket(conn), &input_mask);
+			if (standby_message_timeout)
+			{
+				int64       targettime;
+				long		secs;
+				int			usecs;
+
+				targettime = last_status + (standby_message_timeout - 1) *
+					((int64) 1000);
+				localTimestampDifference(now,
+										 targettime,
+										 &secs,
+										 &usecs);
+				if (secs <= 0)
+					timeout.tv_sec = 1; /* Always sleep at least 1 sec */
+				else
+					timeout.tv_sec = secs;
+				timeout.tv_usec = usecs;
+				timeoutptr = &timeout;
+			}
+			else
+				timeoutptr = NULL;
+
+			r = select(PQsocket(conn) + 1, &input_mask, NULL, NULL, timeoutptr);
+			if (r == 0 || (r < 0 && errno == EINTR))
+			{
+				/*
+				 * Got a timeout or signal. Continue the loop and either
+				 * deliver a status packet to the server or just go back into
+				 * blocking.
+				 */
+				continue;
+			}
+			else if (r < 0)
+			{
+				fprintf(stderr, _("%s: select() failed: %s\n"),
+						progname, strerror(errno));
+				goto error;
+			}
+			/* Else there is actually data on the socket */
+			if (PQconsumeInput(conn) == 0)
+			{
+				fprintf(stderr,
+						_("%s: could not receive data from WAL stream: %s"),
+						progname, PQerrorMessage(conn));
+				goto error;
+			}
+			continue;
+		}
+		if (r == -1)
+			/* End of copy stream */
+			break;
+		if (r == -2)
+		{
+			fprintf(stderr, _("%s: could not read COPY data: %s"),
+					progname, PQerrorMessage(conn));
+			goto error;
+		}
+
+		/* Check the message type. */
+		if (copybuf[0] == 'k')
+		{
+			int		pos;
+			bool	replyRequested;
+
+			/*
+			 * Parse the keepalive message, enclosed in the CopyData message.
+			 * We just check if the server requested a reply, and ignore the
+			 * rest.
+			 */
+			pos = 1;	/* skip msgtype 'k' */
+			pos += 8;	/* skip walEnd */
+			pos += 8;	/* skip sendTime */
+
+			if (r < pos + 1)
+			{
+				fprintf(stderr, _("%s: streaming header too small: %d\n"),
+						progname, r);
+				goto error;
+			}
+			replyRequested = copybuf[pos];
+
+			/* If the server requested an immediate reply, send one. */
+			if (replyRequested)
+			{
+				now = localGetCurrentTimestamp();
+				if (!sendFeedback(conn, logoff, now, false))
+					goto error;
+				last_status = now;
+			}
+			continue;
+		}
+		else if (copybuf[0] != 'w')
+		{
+			fprintf(stderr, _("%s: unrecognized streaming header: \"%c\"\n"),
+					progname, copybuf[0]);
+			goto error;
+		}
+
+
+		/*
+		 * Read the header of the XLogData message, enclosed in the CopyData
+		 * message. We only need the WAL location field (dataStart), the rest
+		 * of the header is ignored.
+		 */
+		hdr_len = 1;	/* msgtype 'w' */
+		hdr_len += 8;	/* dataStart */
+		hdr_len += 8;	/* walEnd */
+		hdr_len += 8;	/* sendTime */
+		if (r < hdr_len + 1)
+		{
+			fprintf(stderr, _("%s: streaming header too small: %d\n"),
+					progname, r);
+			goto error;
+		}
+
+		/* Extract WAL location for this block */
+		{
+			XLogRecPtr temp = recvint64(&copybuf[1]);
+			logoff = Max(temp, logoff);
+		}
+
+		if (outfd == -1 )
+		{
+			outfd = open(outfile, O_CREAT|O_APPEND|O_WRONLY|PG_BINARY,
+						 S_IRUSR | S_IWUSR);
+			if (outfd == -1)
+			{
+				fprintf(stderr,
+						_("%s: could not open log file \"%s\": %s\n"),
+						progname, outfile, strerror(errno));
+				goto error;
+			}
+		}
+
+		bytes_left = r - hdr_len;
+		bytes_written = 0;
+
+
+		while (bytes_left)
+		{
+			int ret;
+
+			ret = write(outfd,
+						copybuf + hdr_len + bytes_written,
+						bytes_left);
+
+			if (ret < 0)
+			{
+				fprintf(stderr,
+						_("%s: could not write %u bytes to log file \"%s\": %s\n"),
+						progname, bytes_left, outfile,
+						strerror(errno));
+				goto error;
+			}
+
+			/* Write was successful, advance our position */
+			bytes_written += ret;
+			bytes_left -= ret;
+		}
+
+		if (write(outfd, "\n", 1) != 1)
+		{
+			fprintf(stderr,
+					_("%s: could not write %u bytes to log file \"%s\": %s\n"),
+					progname, 1, outfile,
+					strerror(errno));
+			goto error;
+		}
+	}
+
+	res = PQgetResult(conn);
+	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	{
+		fprintf(stderr,
+				_("%s: unexpected termination of replication stream: %s"),
+				progname, PQresultErrorMessage(res));
+		goto error;
+	}
+	PQclear(res);
+
+	if (copybuf != NULL)
+		PQfreemem(copybuf);
+
+	if (outfd != -1 && close(outfd) != 0)
+		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
+				progname, outfile, strerror(errno));
+	outfd = -1;
+error:
+	PQfinish(conn);
+}
+
+/*
+ * When sigint is called, just tell the system to exit at the next possible
+ * moment.
+ */
+#ifndef WIN32
+
+static void
+sigint_handler(int signum)
+{
+	time_to_abort = true;
+}
+#endif
+
+int
+main(int argc, char **argv)
+{
+	static struct option long_options[] = {
+		{"help", no_argument, NULL, '?'},
+		{"version", no_argument, NULL, 'V'},
+		{"file", required_argument, NULL, 'f'},
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"database", required_argument, NULL, 'd'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-loop", no_argument, NULL, 'n'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"status-interval", required_argument, NULL, 's'},
+		{"verbose", no_argument, NULL, 'v'},
+		{NULL, 0, NULL, 0}
+	};
+	int			c;
+	int			option_index;
+
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_receivellog"));
+
+	if (argc > 1)
+	{
+		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
+		{
+			usage();
+			exit(0);
+		}
+		else if (strcmp(argv[1], "-V") == 0 ||
+				 strcmp(argv[1], "--version") == 0)
+		{
+			puts("pg_receivellog (PostgreSQL) " PG_VERSION);
+			exit(0);
+		}
+	}
+
+	while ((c = getopt_long(argc, argv, "f:h:p:d:U:s:nwWv",
+							long_options, &option_index)) != -1)
+	{
+		switch (c)
+		{
+			case 'f':
+				outfile = pg_strdup(optarg);
+				break;
+			case 'h':
+				dbhost = pg_strdup(optarg);
+				break;
+			case 'd':
+				dbname = pg_strdup(optarg);
+				break;
+			case 'p':
+				if (atoi(optarg) <= 0)
+				{
+					fprintf(stderr, _("%s: invalid port number \"%s\"\n"),
+							progname, optarg);
+					exit(1);
+				}
+				dbport = pg_strdup(optarg);
+				break;
+			case 'U':
+				dbuser = pg_strdup(optarg);
+				break;
+			case 'w':
+				dbgetpassword = -1;
+				break;
+			case 'W':
+				dbgetpassword = 1;
+				break;
+			case 's':
+				standby_message_timeout = atoi(optarg) * 1000;
+				if (standby_message_timeout < 0)
+				{
+					fprintf(stderr, _("%s: invalid status interval \"%s\"\n"),
+							progname, optarg);
+					exit(1);
+				}
+				break;
+			case 'n':
+				noloop = 1;
+				break;
+			case 'v':
+				verbose++;
+				break;
+			default:
+
+				/*
+				 * getopt_long already emitted a complaint
+				 */
+				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+						progname);
+				exit(1);
+		}
+	}
+
+	/*
+	 * Any non-option arguments?
+	 */
+	if (optind < argc)
+	{
+		fprintf(stderr,
+				_("%s: too many command-line arguments (first is \"%s\")\n"),
+				progname, argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+	/*
+	 * Required arguments
+	 */
+	if (outfile == NULL)
+	{
+		fprintf(stderr, _("%s: no target file specified\n"), progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+	if (dbname == NULL)
+	{
+		fprintf(stderr, _("%s: no database specified\n"), progname);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+
+#ifndef WIN32
+	pqsignal(SIGINT, sigint_handler);
+#endif
+
+	while (true)
+	{
+		StreamLog();
+		if (time_to_abort)
+		{
+			/*
+			 * We've been Ctrl-C'ed. That's not an error, so exit without an
+			 * errorcode.
+			 */
+			exit(0);
+		}
+		else if (noloop)
+		{
+			fprintf(stderr, _("%s: disconnected.\n"), progname);
+			exit(1);
+		}
+		else
+		{
+			fprintf(stderr,
+					/* translator: check source for value for %d */
+					_("%s: disconnected. Waiting %d seconds to try again.\n"),
+					progname, RECONNECT_SLEEP_TIME);
+			pg_usleep(RECONNECT_SLEEP_TIME * 1000000);
+		}
+	}
+}
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 22f4128..fecdd52 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -27,6 +27,7 @@ const char *progname;
 char	   *dbhost = NULL;
 char	   *dbuser = NULL;
 char	   *dbport = NULL;
+char	   *dbname = NULL;
 int			dbgetpassword = 0;	/* 0=auto, -1=never, 1=always */
 static char *dbpassword = NULL;
 PGconn	   *conn = NULL;
@@ -96,7 +97,7 @@ GetConnection(void)
 	values = pg_malloc0((argcount + 1) * sizeof(*values));
 
 	keywords[0] = "dbname";
-	values[0] = "replication";
+	values[0] = dbname == NULL ? "replication" : dbname;
 	keywords[1] = "replication";
 	values[1] = "true";
 	keywords[2] = "fallback_application_name";
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index fdf3641..555877e 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -4,6 +4,7 @@ extern const char *progname;
 extern char *dbhost;
 extern char *dbuser;
 extern char *dbport;
+extern char *dbname;
 extern int	dbgetpassword;
 
 /* Connection kept global so we can disconnect easily */

#15

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

1 attachment(s)

[PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

---
src/backend/replication/logical/DESIGN.txt | 603 +++++++++++++++++++++
src/backend/replication/logical/Makefile | 6 +
.../replication/logical/README.SNAPBUILD.txt | 298 ++++++++++
3 files changed, 907 insertions(+)
create mode 100644 src/backend/replication/logical/DESIGN.txt
create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt

Attachments:

0014-design-document-v2.3-and-snapshot-building-design-do.patchtext/x-patch; name=0014-design-document-v2.3-and-snapshot-building-design-do.patchDownload

diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt
new file mode 100644
index 0000000..8383602
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN.txt
@@ -0,0 +1,603 @@
+//-*- mode: adoc -*-
+= High Level Design for Logical Replication in Postgres =
+:copyright: PostgreSQL Global Development Group 2012
+:author: Andres Freund, 2ndQuadrant Ltd.
+:email: andres@2ndQuadrant.com
+
+== Introduction ==
+
+This document aims to first explain why we think postgres needs another
+replication solution and what that solution needs to offer in our opinion. Then
+it sketches out our proposed implementation.
+
+In contrast to an earlier version of the design document which talked about the
+implementation of four parts of replication solutions:
+
+1. Source data generation
+1. Transportation of that data
+1. Applying the changes
+1. Conflict resolution
+
+this version only plans to talk about the first part in detail as it is an
+independent and complex part usable for a wide range of use cases which we want
+to get included into postgres in a first step.
+
+=== Previous discussions ===
+
+There are two rather large threads discussing several parts of the initial
+prototype and proposed architecture:
+
+- http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com[Logical Replication/BDR prototype and architecture]
+- http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL]
+
+Those discussions lead to some fundamental design changes which are presented in this document.
+
+=== Changes from v1 ===
+* At least a partial decoding step required/possible on the source system
+* No intermediate ("schema only") instances required
+* DDL handling, without event triggers
+* A very simple text conversion is provided for debugging/demo purposes
+* Smaller scope
+
+== Existing approaches to replication in Postgres ==
+
+If any currently used approach to replication can be made to support every
+use-case/feature we need, it likely is not a good idea to implement something
+different. Currently three basic approaches are in use in/around postgres
+today:
+
+. Trigger based
+. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery]
+. Statement based
+
+Statement based replication has obvious and known problems with consistency and
+correctness making it hard to use in the general case so we will not further
+discuss it here.
+
+Lets have a look at the advantages/disadvantages of the other approaches:
+
+=== Trigger based Replication ===
+
+This variant has a multitude of significant advantages:
+
+* implementable in userspace
+* easy to customize
+* just about everything can be made configurable
+* cross version support
+* cross architecture support
+* can feed into systems other than postgres
+* no overhead from writes to non-replicated tables
+* writable standbys
+* mature solutions
+* multimaster implementations possible & existing
+
+But also a number of disadvantages, some of them very hard to solve:
+
+* essentially duplicates the amount of writes (or even more!)
+* synchronous replication hard or impossible to implement
+* noticeable CPU overhead
+** trigger functions
+** text conversion of data
+* complex parts implemented in several solutions
+* not in core
+
+Especially the higher amount of writes might seem easy to solve at a first
+glance but a solution not using a normal transactional table for its log/queue
+has to solve a lot of problems. The major ones are:
+
+* crash safety, restartability & spilling to disk
+* consistency with the commit status of transactions
+* only a minimal amount of synchronous work should be done inside individual
+transactions
+
+In our opinion those problems are restricting progress/wider distribution of
+these class of solutions. It is our aim though that existing solutions in this
+space - most prominently slony and londiste - can benefit from the work we are
+doing & planning to do by incorporating at least parts of the changeset
+generation infrastructure.
+
+=== Recovery based Replication ===
+
+This type of solution, being built into postgres and of increasing popularity,
+has and will have its use cases and we do not aim to replace but to complement
+it. We plan to reuse some of the infrastructure and to make it possible to mix
+both modes of replication
+
+Advantages:
+
+* builtin
+* built on existing infrastructure from crash recovery
+* efficient
+** minimal CPU, memory overhead on primary
+** low amount of additional writes
+* synchronous operation mode
+* low maintenance once setup
+* handles DDL
+
+Disadvantages:
+
+* standbys are read only
+* no cross version support
+* no cross architecture support
+* no replication into foreign systems
+* hard to customize
+* not configurable on the level of database, tables, ...
+
+== Goals ==
+
+As seen in the previous short survey of the two major interesting classes of
+replication solution there is a significant gap between those. Our aim is to
+make it smaller.
+
+We aim for:
+
+* in core
+* low CPU overhead
+* low storage overhead
+* asynchronous, optionally synchronous operation modes
+* robust
+* modular
+* basis for other technologies (sharding, replication into other DBMS's, ...)
+* basis for at least one multi-master solution
+* make the implementation as unintrusive as possible, but not more
+
+== New Architecture ==
+
+=== Overview ===
+
+Our proposal is to reuse the basic principle of WAL based replication, namely
+reusing data that already needs to be written for another purpose, and extend
+it to allow most, but not all, the flexibility of trigger based solutions.
+We want to do that by decoding the WAL back into a non-physical form.
+
+To get the flexibility we and others want we propose that the last step of
+changeset generation, transforming it into a format that can be used by the
+replication consumer, is done in an extensible manner. In the schema the part
+that does that is described as 'Output Plugin'. To keep the amount of
+duplication between different plugins as low as possible the plugin should only
+do a a very limited amount of work.
+
+The following paragraphs contain reasoning for the individual design decisions
+made and their highlevel design.
+
+=== Schematics ===
+
+The basic proposed architecture for changeset extraction is presented in the
+following diagram. The first part should look familiar to anyone knowing
+postgres' architecture. The second is where most of the new magic happens.
+
+[[basic-schema]]
+.Architecture Schema
+["ditaa"]
+------------------------------------------------------------------------------
+        Traditional Stuff
+
+ +---------+---------+---------+---------+----+
+ | Backend | Backend | Backend | Autovac | ...|
+ +----+----+---+-----+----+----+----+----+-+--+
+      |        |          |         |      |
+      +------+ | +--------+         |      |
+    +-+      | | | +----------------+      |
+    |        | | | |                       |
+    |        v v v v                       |
+    |     +------------+                   |
+    |     | WAL writer |<------------------+
+    |     +------------+
+    |       | | | | |
+    v       v v v v v       +-------------------+
++--------+ +---------+   +->| Startup/Recovery  |
+|{s}     | |{s}      |   |  +-------------------+
+|Catalog | |   WAL   |---+->| SR/Hot Standby    |
+|        | |         |   |  +-------------------+
++--------+ +---------+   +->| Point in Time     |
+    ^          |            +-------------------+
+ ---|----------|--------------------------------
+    |       New Stuff
++---+          |
+|              v            Running separately
+| +----------------+  +=-------------------------+
+| | Walsender  |   |  |                          |
+| |            v   |  |    +-------------------+ |
+| +-------------+  |  | +->| Logical Rep.      | |
+| |     WAL     |  |  | |  +-------------------+ |
++-|  decoding   |  |  | +->| Multimaster       | |
+| +------+------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Slony             | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Auditing          | |
+| |     TX      |  |  | |  +-------------------+ |
++-| reassembly  |  |  | +->| Mysql/...         | |
+| +-------------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Custom Solutions  | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Debugging         | |
+| |   Output    |  |  | |  +-------------------+ |
++-|   Plugin    |--|--|-+->| Data Recovery     | |
+  +-------------/  |  |    +-------------------+ |
+  |                |  |                          |
+  +----------------+  +--------------------------|
+------------------------------------------------------------------------------
+
+=== WAL enrichement ===
+
+To be able to decode individual WAL records at the very minimal they need to
+contain enough information to reconstruct what has happened to which row. The
+action is already encoded in the WAL records header in most of the cases.
+
+As an example of missing data, the WAL record emitted when a row gets deleted,
+only contains its physical location. At the very least we need a way to
+identify the deleted row: in a relational database the minimal amount of data
+that does that should be the primary key footnote:[Yes, there are use cases
+where the whole row is needed, or where no primary key can be found].
+
+We propose that for now it is enough to extend the relevant WAL record with
+additional data when the newly introduced 'WAL_level = logical' is set.
+
+Previously it has been argued on the hackers mailing list that a generic 'WAL
+record annotation' mechanism might be a good thing. That mechanism would allow
+to attach arbitrary data to individual wal records making it easier to extend
+postgres to support something like what we propose.. While we don't oppose that
+idea we think it is largely orthogonal issue to this proposal as a whole
+because the format of a WAL records is version dependent by nature and the
+necessary changes for our easy way are small, so not much effort is lost.
+
+A full annotation capability is a complex endeavour on its own as the parts of
+the code generating the relevant WAL records has somewhat complex requirements
+and cannot easily be configured from the outside.
+
+Currently this is contained in the http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-andres@2ndquadrant.com[Log enough data into the wal to reconstruct logical changes from it] patch.
+
+=== WAL parsing & decoding ===
+
+The main complexity when reading the WAL as stored on disk is that the format
+is somewhat complex and the existing parser is too deeply integrated in the
+recovery system to be directly reusable. Once a reusable parser exists decoding
+the binary data into individual WAL records is a small problem.
+
+Currently two competing proposals for this module exist, each having its own
+merits. In the grand scheme of this proposal it is irrelevant which one gets
+picked as long as the functionality gets integrated.
+
+The mailing list post
+http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-andres@2ndquadrant.com[Add
+support for a generic wal reading facility dubbed XLogReader] contains both
+competing patches and discussion around which one is preferable.
+
+Once the WAL has been decoded into individual records two major issues exist:
+
+1. records from different transactions and even individual user level actions
+are intermingled
+1. the data attached to records cannot be interpreted on its own, it is only
+meaningful with a lot of required information (including table, columns, types
+and more)
+
+The solution to the first issue is described in the next section: <<tx-reassembly>>
+
+The second problem is probably the reason why no mature solution to reuse the
+WAL for logical changeset generation exists today. See the <<snapbuilder>>
+paragraph for some details.
+
+As decoding, Transaction reassembly and Snapshot building are interdependent
+they currently are implemented in the same patch:
+http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com[Introduce
+wal decoding via catalog timetravel]
+
+That patch also includes a small demonstration that the approach works in the
+presence of DDL:
+
+[[example-of-decoding]]
+.Decoding example
+[NOTE]
+---------------------------
+/* just so we keep a sensible xmin horizon */
+ROLLBACK PREPARED 'f';
+BEGIN;
+CREATE TABLE keepalive();
+PREPARE TRANSACTION 'f';
+
+DROP TABLE IF EXISTS replication_example;
+
+SELECT pg_current_xlog_insert_location();
+CHECKPOINT;
+CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
+varchar(120));
+begin;
+INSERT INTO replication_example(somedata, text) VALUES (1, 1);
+INSERT INTO replication_example(somedata, text) VALUES (1, 2);
+commit;
+
+
+ALTER TABLE replication_example ADD COLUMN bar int;
+
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);
+COMMIT;
+
+/* slightly more complex schema change, still no table rewrite */
+ALTER TABLE replication_example DROP COLUMN bar;
+INSERT INTO replication_example(somedata, text) VALUES (3, 1);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text) VALUES (3, 2);
+INSERT INTO replication_example(somedata, text) VALUES (3, 3);
+commit;
+
+ALTER TABLE replication_example RENAME COLUMN text TO somenum;
+
+INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);
+
+/* complex schema change, changing types of existing column, rewriting the table */
+ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
+(somenum::int4);
+
+INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);
+
+SELECT pg_current_xlog_insert_location();
+
+/* now decode what has been written to the WAL during that time */
+
+SELECT decode_xlog('0/1893D78', '0/18BE398');
+
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
+WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
+WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
+WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
+(null)
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
+WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
+WARNING:  COMMIT
+
+---------------------------
+
+[[tx-reassembly]]
+=== TX reassembly ===
+
+In order to make usage of the decoded stream easy we want to present the user
+level code with a correctly ordered image of individual transactions at once
+because otherwise every user will have to reassemble transactions themselves.
+
+Transaction reassembly needs to solve several problems:
+
+1. changes inside a transaction can be interspersed with other transactions
+1. a top level transaction only knows which subtransactions belong to it when
+it reads the commit record
+1. individual user level actions can be smeared over multiple records (TOAST)
+
+Our proposed module solves 1) and 2) by building individual streams of records
+split by xid. While not fully implemented yet we plan to spill those individual
+xid streams to disk after a certain amount of memory is used. This can be
+implemented without any change in the external interface.
+
+As all the individual streams are already sorted by LSN by definition - we
+build them from the wal in a FIFO manner, and the position in the WAL is the
+definition of the LSN footnote:[the LSN is just the byte position int the WAL
+stream] - the individual changes can be merged efficiently by a k-way merge
+(without sorting!) by keeping the individual streams in a binary heap.
+
+To manipulate the binary heap a generic implementation is proposed. Several
+independent implementations of binary heaps already exist in the postgres code,
+but none of them is generic.  The patch is available at
+http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-andres@2ndquadrant.com[Add
+minimal binary heap implementation].
+
+[NOTE]
+============
+The reassembly component was previously coined ApplyCache because it was
+proposed to run on replication consumers just before applying changes. This is
+not the case anymore.
+
+It is still called that way in the source of the patch recently submitted.
+============
+
+[[snapbuilder]]
+=== Snapshot building  ===
+
+To decode the contents of wal records describing data changes we need to decode
+and transform their contents. A single tuple is stored in a data structure
+called HeapTuple. As stored on disk that structure doesn't contain any
+information about the format of its contents.
+
+The basic problem is twofold:
+
+1. The wal records only contain the relfilenode not the relation oid of a table
+11. The relfilenode changes when an action performing a full table rewrite is performed
+1. To interpret a HeapTuple correctly the exact schema definition from back
+when the wal record was inserted into the wal stream needs to be available
+
+We chose to implement timetraveling access to the system catalog using
+postgres' MVCC nature & implementation because of the following advantages:
+
+* low amount of additional data in wal
+* genericity
+* similarity of implementation to Hot Standby, quite a bit of the infrastructure is reusable
+* all kinds of DDL can be handled in reliable manner
+* extensibility to user defined catalog like tables
+
+Timetravel access to the catalog means that we are able to look at the catalog
+just as it looked when changes were generated. That allows us to get the
+correct information about the contents of the aforementioned HeapTuple's so we
+can decode them reliably.
+
+Other solutions we thought about that fell through:
+* catalog only proxy instances that apply schema changes exactly to the point
+  were decoding using ``old fashioned'' wal replay
+* do the decoding on a 2nd machine, replicating all DDL exactly, rely on the catalog there
+* do not allow DDL at all
+* always add enough data into the WAL to allow decoding
+* build a fully versioned catalog
+
+The email thread available under
+http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata
+consistency during changeset extraction from WAL] contains some details,
+advantages and disadvantages about the different possible implementations.
+
+How we build snapshots is somewhat intricate and complicated and seems to be
+out of scope for this document. We will provide a second document discussing
+the implementation in detail. Let's just assume it is possible from here on.
+
+[NOTE]
+Some details are already available in comments inside 'src/backend/replication/logical/snapbuild.{c,h}'.
+
+=== Output Plugin ===
+
+As already mentioned previously our aim is to make the implementation of output
+plugins as simple and non-redundant as possible as we expect several different
+ones with different use cases to emerge quickly. See <<basic-schema>> for a
+list of possible output plugins that we think might emerge.
+
+Although we for now only plan to tackle logical replication and based on that a
+multi-master implementation in the near future we definitely aim to provide all
+use-cases with something easily useable!
+
+To decode and translate local transaction an output plugin needs to be able to
+transform transactions as a whole so it can apply them as a meaningful
+transaction at the other side.
+
+What we do to provide that is, that very time we find a transaction commit and
+thus have completed reassembling the transaction we start to provide the
+individual changes to the output plugin. It currently only has to fill out 3
+callbacks:
+[options="header"]
+|=====================================================================================================================================
+|Callback |Passed Parameters                    |Called per TX  | Use
+|begin    |xid                                  |once           |Begin of a reassembled transaction
+|change   |xid, subxid, change, mvcc snapshot   |every change   |Gets passed every change so it can transform it to the target format
+|commit   |xid                                  |once           |End of a reassembled transaction
+|=====================================================================================================================================
+
+During each of those callback an appropriate timetraveling SnapshotNow snapshot
+is setup so the callbacks can perform all read-only catalog accesses they need,
+including using the sys/rel/catcache. For obvious reasons only read access is
+allowed.
+
+The snapshot guarantees that the result of lookups are be the same as they
+were/would have been when the change was originally created.
+
+Additionally they get passed a MVCC snapshot, to e.g. run sql queries on
+catalogs or similar.
+
+[IMPORTANT]
+============
+At the moment none of these snapshots can be used to access normal user
+tables. Adding additional tables to the allowed set is easy implementation
+wise, but every transaction changing such tables incurs a noticeably higher
+overhead.
+============
+
+For now transactions won't be decoded/output in parallel. There are ideas to
+improve on this, but we don't think the complexity is appropriate for the first
+release of this feature.
+
+This is an adoption barrier for databases where large amounts of data get
+loaded/written in one transaction.
+
+=== Setup of replication nodes ===
+
+When setting up a new standby/consumer of a primary some problem exist
+independent of the implementation of the consumer. The gist of the problem is
+that when making a base backup and starting to stream all changes since that
+point transactions that were running during all this cannot be included:
+
+* Transaction that have not committed before starting to dump a database are
+  invisible to the dumping process
+
+* Transactions that began before the point from which on the WAL is being
+  decoded are incomplete and cannot be replayed
+
+Our proposal for a solution to this is to detect points in the WAL stream where we can provide:
+
+. A snapshot exported similarly to pg_export_snapshot() footnote:[http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION] that can be imported with +SET TRANSACTION SNAPSHOT+ footnote:[http://www.postgresql.org/docs/devel/static/sql-set-transaction.html]
+. A stream of changes that will include the complete data of all transactions seen as running by the snapshot generated in 1)
+
+See the diagram.
+
+[[setup-schema]]
+.Control flow during setup of a new node
+["ditaa",scaling="0.7"]
+------------------------------------------------------------------------------
++----------------+
+| Walsender  |   |                               +------------+
+|            v   |                               | Consumer   |
++-------------+  |<--IDENTIFY_SYSTEM-------------|            |
+|     WAL     |  |                               |            |
+|  decoding   |  |----....---------------------->|            |
++------+------/  |                               |            |
+|            |   |                               |            |
+|            v   |                               |            |
++-------------+  |<--INIT_LOGICAL $PLUGIN--------|            |
+|     TX      |  |                               |            |
+| reassembly  |  |---FOUND_STARTING %X/%X------->|            |
++-------------/  |                               |            |
+|            |   |---FOUND_CONSISTENT %X/%X----->|            |
+|            v   |---pg_dump snapshot----------->|            |
++-------------+  |---replication slot %P-------->|            |
+|   Output    |  |                               |            |
+|   Plugin    |  |    ^                          |            |
++-------------/  |    |                          |            |
+|                |    +-run pg_dump separately --|            |
+|                |                               |            |
+|                |<--STREAM_DATA-----------------|            |
+|                |                               |            |
+|                |---data ---------------------->|            |
+|                |                               |            |
+|                |                               |            |
+|                |  ---- SHUTDOWN -------------  |            |
+|                |                               |            |
+|                |                               |            |
+|                |<--RESTART_LOGICAL $PLUGIN %P--|            |
+|                |                               |            |
+|                |---data----------------------->|            |
+|                |                               |            |
+|                |                               |            |
++----------------+                               +------------+
+
+------------------------------------------------------------------------------
+
+=== Disadvantages of the approach ===
+
+* somewhat intricate code for snapshot timetravel
+* output plugins/walsenders need to work per database as they access the catalog
+* when sending to multiple standbys some work is done multiple times
+* decoding/applying multiple transactions in parallel is hard
+
+=== Unfinished/Undecided issues ===
+
+* declaration of user ``catalog'' tables (e.g. userspace enums)
+* finishing different parts of the implementation
+  * spill to disk during transaction reassembly
+  * mixed catalog/data transactions
+  * snapshot refcounting
+  * snapshot exporting
+  * snapshot serialization
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index cf040ef..92b4508 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -17,3 +17,9 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
 OBJS = decode.o logicalfuncs.o reorderbuffer.o snapbuild.o
 
 include $(top_srcdir)/src/backend/common.mk
+
+DESIGN.pdf: DESIGN.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
+
+README.SNAPBUILD.pdf: README.SNAPBUILD.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
diff --git a/src/backend/replication/logical/README.SNAPBUILD.txt b/src/backend/replication/logical/README.SNAPBUILD.txt
new file mode 100644
index 0000000..70e142f
--- /dev/null
+++ b/src/backend/replication/logical/README.SNAPBUILD.txt
@@ -0,0 +1,298 @@
+= Snapshot Building =
+:author: Andres Freund, 2nQuadrant Ltd
+
+== Why do we need timetravel catalog access ==
+
+When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know
+how the catalog looked at the point a record was inserted into the WAL because
+without that information we don't know much more about the record but its
+length. Its just an arbitrary bunch of bytes without further information.
+Unfortunately due the possibility of the table definition changing we cannot
+just access a newer version of the catalog and assume the table definition is just the same.
+
+If only the type information were required it might be enough to annotate the
+wal records with a bit more information (table oid, table name, column name,
+column type) but as we want to be able to convert the output to more useful
+formats like text we need to be able to call output functions. Those need a
+normal environment including the usual caches and normal catalog access to
+lookup operators, functions and other types.
+
+Our solution to this is to add the capability to access the catalog in a way
+that makes it look like it did when the record was inserted into the WAL. The
+locking used during WAL generation guarantees the catalog is/was in a consistent
+state at that point.
+
+Interesting cases include:
+- enums
+- composite types
+- extension types
+- non-C functions
+- relfilenode to table oid mapping
+
+Due to postgres' MVCC nature regular modification of a tables contents are
+theoretically non-destructive. The problem is that there is no way to access
+arbitrary points in time even if the data for it is there.
+
+This module adds the capability to do so in the very limited set of
+circumstances we need it in for wal decoding. It does *not* provide a facility
+to do so in general.
+
+A 'Snapshot' is the datastructure used in postgres to describe which tuples are
+visible and which are not. We need to build a Snapshot which can be used to
+access the catalog the way it looked when the wal record was inserted.
+
+Restrictions:
+* Only works for catalog tables
+* Snapshot modifications are somewhat expensive
+* it cannot build initial visibility information for every point in time, it
+  needs a specific set of circumstances for that
+* limited window in which we can build snapshots
+
+== How do we build timetravel snapshots ==
+
+Hot Standby added infrastructure to build snapshots from WAL during recovery in
+the 9.0 release. Most of that can be reused for our purposes.
+
+We cannot reuse all of the HS infrastructure because:
+* we are not in recovery
+* we need to look *inside* transactions
+* we need the capability to have multiple different snapshots arround at the same time
+
+We need to provide two kinds of snapshots that are implemented rather
+differently in their plain postgres incarnation:
+* SnapshotNow
+* SnapshotMVCC
+
+We need both because if any operators use normal functions they will get
+executed with SnapshotMVCC semantics while the catcache and related things will
+rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal
+SnapshotNow because we wouldn't access the old version of the catalog in that
+case. Instead something like an MVCC snapshot with the correct visibility
+information. That also means that snapshot won't have some race issues normal
+SnapshotNow has.
+
+Everytime a transaction that changed the catalog commits all other transactions
+will need a new snapshot that marks that transaction (and its subtransactions)
+as visible.
+
+Our snapshot representation is a bit different from normal snapshots, but we
+still reuse the normal SnapshotData struct:
+* Snapshot->xip contains all transaction we consider committed
+* Snapshot->subxip contains all transactions belonging to our transaction,
+  including the toplevel one
+
+The meaning of ->xip is inverted in comparison with non-timetravel snapshots
+because usually only a tiny percentage of comitted transactions will have
+modified the catalog between xmin and xmax. It also makes subtransaction
+handling easier (we cannot query pg_subtrans).
+
+== Building of initial snapshot ==
+
+We can start building an initial snapshot as soon as we find either an
+XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT_SHUTDOWN record because both allow us
+to know how many transactions are running.
+
+We need to know which transactions were running when we start to build a
+snapshot/start decoding as we don't have enough information about those as they
+could have done catalog modifications before we started watching. Also we
+wouldn't have the complete contents of those transactions as we started reading
+after they started.  The latter is also important to build snapshots which can
+be used to build a consistent initial clone.
+
+There also is the problem that XLOG_RUNNING_XACT records can be 'suboverflowed'
+which means there were more running subtransactions than fitting into shared
+memory. In that case we use the same incremental building trick HS uses which
+is either
+1) wait till further XLOG_RUNNING_XACT records have a running->oldestRunningXid
+after the initial xl_runnign_xacts->nextXid
+2) wait for a further XLOG_RUNNING_XACT thatis not overflowed or
+a XLOG_CHECKPOINT_SHUTDOWN
+
+XXX: we probably don't need to care about ->suboverflowed at all as we only
+need to know about committed XIDs and we get enough information about
+subtransactions at commit.. More thinking needed.
+
+When we start building a snapshot we are in the 'SNAPBUILD_START' state. As
+soon as we find any visibility information, even if incomplete, we change to
+SNAPBUILD_INITIAL_POINT.
+
+When we have collected enough information to decode any transaction starting
+after that point in time we fall over to SNAPBUILD_FULL_SNAPSHOT. If those
+transactions commit before the next state is reached we throw their complete
+content away.
+
+When all transactions that were running when we switched over to FULL_SNAPSHOT
+commited, we change into the 'SNAPBUILD_CONSISTENT' state. Every transaction
+that commits from now on gets handed to the output plugin.
+When doing the switch to CONSISTENT we optionally export a snapshot which makes
+all transactions visible that committed up to this point. That exported
+snapshot allows the user to run pg_dump on it and replay all changes received
+on a restored dump to get a consistent new clone.
+
+["ditaa",scaling="0.8"]
+---------------
+
++-------------------------+
+|SNAPBUILD_START          |-----------------------+
+|                         |-----------+           |
++-------------------------+           |           |
+             |                        |           |
+ XLOG_RUNNING_XACTS suboverflowed     |   saved snapshot
+             |                        |           |
+             |                        |           |
+             |                        |           |
+             v                        |           |
++-------------------------+           v           v
+|SNAPBUILD_INITIAL        |---------------------->+
+|                         |---------->+           |
++-------------------------+           |           |
+             |                        |           |
+ oldestRunningXid past initialNextXid |           |
+             |                        |           |
+             |  XLOG_RUNNING_XACTS !suboverflowed |
+             v                        |           |
++-------------------------+           |           |
+|SNAPBUILD_FULL_SNAPSHOT  |<----------+           v
+|                         |---------------------->+
++-------------------------+                       |
+             |                                    |
+             |                    XLOG_CHECKPOINT_SHUTDOWN
+ any running txn's finished                       |
+             |                                    |
+             v                                    |
++-------------------------+                       |
+|SNAPBUILD_CONSISTENT     |<----------------------+
+|                         |
++-------------------------+
+
+---------------
+
+== Snapshot Management ==
+
+Whenever a transaction is detected as having started during decoding after
+SNAPBUILD_FULL_SNAPSHOT is reached we distribute the currently maintained
+snapshot to it (i.e. call ApplyCacheAddBaseSnapshot). This serves as its
+initial SnapshotNow and SnapshotMVCC. Unless there are concurrent catalog
+changes that snapshot won't ever change.
+
+Whenever a transaction commits that had catalog changes we iterate over all
+concurrently active transactions and add a new SnapshotNow to it
+(ApplyCacheAddBaseSnapshot(current_lsn)). This is required because any row
+written from now that point on will have used the changed catalog
+contents. This is possible to occur even with correct locking.
+
+SnapshotNow's need to be setup globally so the syscache and other pieces access
+it transparently. This is done using two new tqual.h functions:
+SetupDecodingSnapshots() and RevertFromDecodingSnapshots().
+
+== Catalog/User Table Detection ==
+
+To detect whether a record/transaction does catalog modifications - which we
+need to do for memory/performance reasons - we need to resolve the
+RelFileNode's in xlog records back to the original tables. Unfortunately
+RelFileNode's only contain the tables relfilenode, not their table oid. We only
+can do catalog access once we reached FULL_SNAPSHOT, before that we can use
+some heuristics but otherwise we have to assume that every record changes the
+catalog.
+
+The heuristics we can use are:
+* relfilenode->spcNode == GLOBALTABLESPACE_OID
+* relfilenode->relNode <= FirstNormalObjectId
+* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
+
+Those detect some catalog tables but not all (think VACUUM FULL), but if they
+detect one they are correct.
+
+After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
+a table might not be a catalog table. For that we use the new RELFILENODE
+syscache with (spcNode, relNode).
+
+XXX: Note that that syscache is a bit problematic because its not actually
+unique because shared/nailed catalogs store a 0 as relfilenode (they are stored
+in the relmapper). Those are never looked up though, so it might be
+ok. Unfortunately it doesn't seem to be possible to use a partial index (WHERE
+relfilenode != 0) here.
+
+XXX: For some usecases it would be useful to treat some user specified tables
+as catalog tables
+
+== System Table Rewrite Handling ==
+
+XXX, expand, XXX
+
+NOTES:
+* always using newest relmapper, use newest invalidations
+* old tuples are preserved across rewrites, thats fine
+* REINDEX/CLUSTER pg_class; in a transaction
+
+== mixed DDL/DML transaction handling  ==
+
+When a transactions uses DDL and DML in the same transaction things get a bit
+more complicated because we need to handle CommandIds and ComboCids as we need
+to use the correct version of the catalog when decoding the individual tuples.
+
+CommandId handling itself is relatively simple, we can figure out the current
+CommandId relatively easily by looking at the currently used one in
+changes. The problematic part is that those CommandId frequently will not be
+actual cmin or cmax values but ComboCids. Those are used to minimize space in
+the heap. During normal operation cmin/cmax values are only used within the
+backend emitting those rows and only during one toplevel transaction, so
+instead of storing cmin/cmax only a reference to an in-memory value is stored
+that contains both. Whenever we see a new CommandId we call
+ApplyCacheAddNewCommandId.
+
+To resolve this problem during heap_* whenever we generate a new combocid
+(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
+we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
+decoding this ComboCid is added to the applycache
+(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
+single transaction, so we cannot simply setup all of them globally. Before
+calling the output plugin ComboCids are temporarily setup and torn down
+afterwards.
+
+All this only needs to happen in the transaction performing the DDL.
+
+== Cache Handling ==
+
+As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate
+cache invalidation. For transactions without DDL thats easy as everything is
+already provided by HS. Everytime we read a commit record we apply the sinval
+messages contained therein.
+
+For transactions that contain DDL and DML cache invalidation needs to happen
+more frequently because we need to all tore down all caches that just got
+modified. To do that we simply apply all invalidation messages that got
+collected at the end of transaction and apply them after every single change.
+At some point this can get optimized by generating new local invalidation
+messages, but that seems too complicated for now.
+
+XXX: think/talk about syscache invalidation of relmapper/pg_class changes.
+
+== xmin Horizon Handling ==
+
+Reusing MVCC for timetravel access has one obvious major problem:
+VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
+obviously, we want autovacuum/manual vacuum to work as before.
+
+The idea here is to reuse the infrastrcuture built for hot_standby_feedback
+which allows us to keep the xmin horizon of a walsender backend artificially
+low. We keep it low enough so we can restart decoding from the last location
+the client has confirmed to be safely received. The means that we keep it low
+enough to contain the last checkpoints oldestXid value.
+
+That also means we need to make that value persist across restarts/crashes in a
+very similar manner to twophase.c's. That infrastructure actually also useful
+to make hot_standby_feedback work properly across primary restarts.
+
+== Restartable Decoding ==
+
+As we want to generate a consistent stream of changes we need to have the
+ability to start from a previously decoded location without going to the whole
+multi-phase setup because that would make it very hard to calculate up to where
+we need to keep information.
+
+To make that easier everytime a decoding process finds an online checkpoint
+record it exlusively takes a global lwlock and checks whether visibility
+information has been already been written out for that checkpoint and does so
+if not. We only need to do that once as visibility information is the same
+between all decoding backends.

#16

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3 - git repository

On 2012-11-15 01:27:46 +0100, Andres Freund wrote:

In response to this you will soon find the 14 patches that currently
implement $subject.

As its not very wieldly to send around that many/big patches all the
time, until the next "major" version I will just update the git tree at:

Web:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3

Git:
git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Noname (#1)

Re: lcr - walsender integration

Hi,

The current logical walsender integration looks like the following:

=# START_LOGICAL_REPLICATION 'id-2' 3/CACBDF98;
...

So the current protocol is:
INIT_LOGICAL_REPLICATION '$plugin';
returns
* slot
* first consistent point
* snapshot id

START_LOGICAL_REPLICATION '$slot' $last_received_lsn;

streams changes, each wrapped in a 'w' message with (start, end) set to
the same value. The content of the data is completely free-format and
only depends on the output plugin.

Feedback is provided from the client via the normal 'r' messages.

I think thats not a bad start, but we probably can improve it a bit:

INIT_LOGICAL_REPLICATION '$slot' '$plugin' ($value = $key, ...);
START_LOGICAL_REPLICATION '$slot' $last_received_lsn;
STOP_LOGICAL_REPLICATION '$slot';

The option to INIT_LOGICAL_REPLICATION would then get passed to the
'pg_decode_init' output plugin function (i.e. a function of that name
would get dlsym()'ed using the pg infrastructure for that).

Does that look good to you? Any suggestions?

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#18

Josh Berkus

josh@agliodbs.com

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3

On 11/14/12 4:27 PM, Andres Freund wrote:

Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Lemme be the first to say, "wow". Impressive work.

Now the debugging starts ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#19

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3

Looks like cool stuff @-@
I might be interested in looking at that a bit as I think I will hopefully
be hopefully be able to grab some time in the next couple of weeks.
Are some of those patches already submitted to a CF?
--
Michael Paquier
http://michael.otacoo.com

#20

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Michael Paquier (#19)

Re: logical changeset generation v3

Hi,

On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:

Looks like cool stuff @-@
I might be interested in looking at that a bit as I think I will hopefully
be hopefully be able to grab some time in the next couple of weeks.
Are some of those patches already submitted to a CF?

I added the patchset as one entry to the CF this time, it seems to me they are
too hard to judge individually to make them really separately reviewable.

I can split it off there, but really all the complicated stuff is in one patch
anyway...

Greetings,

Andres

#21

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Andres Freund (#9)

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:

To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.

That looks good, not much change. Will apply in next few days. Please
add me as committer and mark ready.

Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.

...but please don't add extra touches of Andres magic along the way.
Doing that will just slow down patch acceptance and its not important.
I suggest to keep note of things like that and come back to them
later.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Simon Riggs (#21)

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

On 2012-11-15 09:07:23 -0300, Simon Riggs wrote:

On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:

To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.

That looks good, not much change. Will apply in next few days. Please
add me as committer and mark ready.

Cool. Will do.

Also skip logging any subxids if the snapshot is suboverflowed, they aren't
useful in that case anyway.

This allows to make some operations cheaper and it allows faster startup for
the future logical decoding feature because that doesn't care about
subtransactions/suboverflow'edness.

...but please don't add extra touches of Andres magic along the way.
Doing that will just slow down patch acceptance and its not important.
I suggest to keep note of things like that and come back to them
later.

Which magic are you talking about?

Only two parts changed in comparison to the previous situation. One is
that the following in ProcArrayApplyRecoveryInfo only applies to
toplevel transactions by virtue of ->xcnt now only containing the
toplevel transaction count:

+       /*
+        * Remove stale locks, if any.
+        *
+        * Locks are always assigned to the toplevel xid so we don't
need to care
+        * about subxcnt/subxids (and by extension not about
->suboverflowed).
+        */
StandbyReleaseOldLocks(running->xcnt, running->xids);

Note that there was no code change, just a change in meaning.

The other part is:

+       /*
+        * Spin over procArray collecting all subxids, but only if there hasn't
+        * been a suboverflow.
+        */
+       if (!suboverflowed)

Well, thats something that basically had to be decided either way when
writing the patch...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Andres Freund (#3)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this patch
since last commitfest. I thought we agreed on the approach I championed
for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
wasn't much work left to finish that, I believe.

Are you going to continue working on this?

- Heikki

#24

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#23)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this patch since
last commitfest. I thought we agreed on the approach I championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
wasn't much work left to finish that, I believe.

While I still think my approach is superior at this point I have
accepted that I haven't convinced anybody of that. I plan to port over
what I have submitted to Alvaro's version of your patch.

I have actually started that but I simply couldn't finish it in
time. The approach for porting I took didn't work all that well and I
plan to restart doing that after doing some review work.

Are you going to continue working on this?

"this" being my version of XlogReader? No. The patch above is unchanged
except some very minor rebasing to recent wal changes by Tom. The reason
its included in the series is simply that I haven't gotten rid of it yet
and the subsequent patches needed it. I do plan to continue working on a
rebased xlogdump version if nobody beats me to it (please do beat me!).

Ok?

The cover letter said:

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

I should have folded that in into the patch description, sorry.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#23)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas wrote:

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this
patch since last commitfest. I thought we agreed on the approach I
championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
There wasn't much work left to finish that, I believe.

Are you going to continue working on this?

I worked a bit more on that patch of yours, but I neglected to submit
it. Did you have something in particular that you wanted changed in it?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#26

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Alvaro Herrera (#25)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this
patch since last commitfest. I thought we agreed on the approach I
championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
There wasn't much work left to finish that, I believe.

Are you going to continue working on this?

I worked a bit more on that patch of yours, but I neglected to submit
it. Did you have something in particular that you wanted changed in it?

Could you push your newest version to your git repository or similar?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#26)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Andres Freund wrote:

On 2012-11-15 11:50:37 -0300, Alvaro Herrera wrote:

I worked a bit more on that patch of yours, but I neglected to submit
it. Did you have something in particular that you wanted changed in it?

Could you push your newest version to your git repository or similar?

Sadly, I cannot, because I had it on my laptop only and its screen died
this morning (well, actually it doesn't boot at all, so I can't use the
external screen either). I'm trying to get it fixed as soon as possible
but obviously I have no idea when I will be able to get it back. Most
likely I will have to go out and buy a 2.5" drive enclosure to get the
valuable stuff out of it.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Alvaro Herrera (#25)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 15.11.2012 16:50, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this
patch since last commitfest. I thought we agreed on the approach I
championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
There wasn't much work left to finish that, I believe.

Are you going to continue working on this?

I worked a bit more on that patch of yours, but I neglected to submit
it. Did you have something in particular that you wanted changed in it?

Off the top of my head, there were a two open items with the patch as I
submitted it:

1. Need to make sure it's easy to compile outside backend code. So that
it's suitable for using in an xlogdump contrib module, for example.

2. do something about error reporting. In particular, xlogreader.c
should not call emode_for_corrupt_record(), but we need to provide for
that functionlity somehow. I think I'd prefer xlogreader.c to not
ereport() on a corrupt record. Instead, it would return an error string
to the caller, which could then decide what to do with it. Translating
the messages needs some thought, though.

Other than those, and cleanup of any obsoleted comments etc. and adding
docs, I think it was good to go.

- Heikki

#29

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#28)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas wrote:

On 15.11.2012 16:50, Alvaro Herrera wrote:

I worked a bit more on that patch of yours, but I neglected to submit
it. Did you have something in particular that you wanted changed in it?

Off the top of my head, there were a two open items with the patch
as I submitted it:

1. Need to make sure it's easy to compile outside backend code. So
that it's suitable for using in an xlogdump contrib module, for
example.

2. do something about error reporting. In particular, xlogreader.c
should not call emode_for_corrupt_record(), but we need to provide
for that functionlity somehow. I think I'd prefer xlogreader.c to
not ereport() on a corrupt record. Instead, it would return an error
string to the caller, which could then decide what to do with it.
Translating the messages needs some thought, though.

Other than those, and cleanup of any obsoleted comments etc. and
adding docs, I think it was good to go.

Thanks. I was toying with the idea that xlogreader.c should return a
status code to the caller, and additionally an error string; not all
error cases are equal.

Most of what I did (other than general cleanup) was moving some xlog.c
global vars into a private_data struct for xlogreader.c to pass around;
one problem I had was deciding what to do with curFileTLI and
LastFileTLI (IIRC), because they are used outside of the reader module
(they were examined after recovery finished).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30

Peter Eisentraut

peter_e@gmx.net

about 13 years ago

In reply to: Andres Freund (#4)

Re: [PATCH 03/14] Add simple xlogdump tool

On 11/14/12 8:17 PM, Andres Freund wrote:

diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global

SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump

should be pg_xlogdump

ifeq ($(PORTNAME), win32)
SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 0000000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile

fix that

+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+	 $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

This looks pretty evil, and there is no documentation about what it is
supposed to do.

Windows build support needs some thought.

diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 0000000..0f984e4
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,468 @@
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+
+#include "getopt_long.h"
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = "xlogdump";

Which may be a reason not to link with main.o. We generally don't want
to hardcode the program name inside the program.

+static void
+usage(void)
+{
+	printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -v, --version          output version information, then exit\n"));
+	printf(_("  -h, --help             show this help, then exit\n"));
+	printf(_("  -s, --start            from where recptr onwards to read\n"));
+	printf(_("  -e, --end              up to which recptr to read\n"));
+	printf(_("  -t, --timeline         which timeline do we want to read\n"));
+	printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
+	printf(_("  -o, --output           where to write [start, end]\n"));
+	printf(_("  -f, --file             wal file to parse\n"));
+}

Options list should be in alphabetic order (or some other less random
order). Most of these descriptions are not very intelligible (at least
without additional documentation).

+
+int main(int argc, char **argv)
+{
+	uint32 xlogid;
+	uint32 xrecoff;
+	XLogReaderState *xlogreader_state;
+	XLogDumpPrivateData private;
+	XLogRecPtr from = InvalidXLogRecPtr;
+	XLogRecPtr to = InvalidXLogRecPtr;
+	bool bad_argument = false;
+
+	static struct option long_options[] = {
+		{"help", no_argument, NULL, 'h'},
+		{"version", no_argument, NULL, 'v'},

Standard letters for help and version are ? and V.

+		{"start", required_argument, NULL, 's'},
+		{"end", required_argument, NULL, 'e'},
+		{"timeline", required_argument, NULL, 't'},
+		{"inpath", required_argument, NULL, 'i'},
+		{"outpath", required_argument, NULL, 'o'},
+		{"file", required_argument, NULL, 'f'},
+		{NULL, 0, NULL, 0}
+	};
+	int			c;
+	int			option_index;
+
+	memset(&private, 0, sizeof(XLogDumpPrivateData));
+
+	while ((c = getopt_long(argc, argv, "hvs:e:t:i:o:f:",

This could also be in a less random order.

+							long_options, &option_index)) != -1)
+	{
+		switch (c)
+		{
+			case 'h':
+				usage();
+				exit(0);
+				break;
+			case 'v':
+				printf("Version: 0.1\n");
+				exit(0);
+				break;

This should be the PostgreSQL version.

also:

no man page

no nls.mk

#31

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Peter Eisentraut (#30)

Re: [PATCH 03/14] Add simple xlogdump tool

On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:

On 11/14/12 8:17 PM, Andres Freund wrote:

diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global

SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump

should be pg_xlogdump

Good Point.

ifeq ($(PORTNAME), win32)
SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 0000000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile

fix that

Dito.

+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+	 $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

This looks pretty evil, and there is no documentation about what it is
supposed to do.

There has been some talk about this before and this clearly isn't an
acceptable solution. The previously stated idea was to split of the
_desc routines so we don't need to link with the whole backend.

Alvaro stared to work on that a bit:
http://archives.postgresql.org/message-id/1346268803-sup-9854%40alvh.no-ip.org

(What the above does is simply collect all backend object files, remove
main.o from that list an dlist them as dependencies.)

Windows build support needs some thought.

I don't have the slightest clue how the windows build environment works,
is there still a problem if we only link to a very selected list of
backend object files? Or do we need to link them to some external
location?

diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 0000000..0f984e4
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,468 @@
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+
+#include "getopt_long.h"
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = "xlogdump";

Which may be a reason not to link with main.o.

Well, we're not linking to main.o which causes the problem, but yes,
really fixing this is definitely the goal, but not really possible yet.

+static void
+usage(void)
+{
+	printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -v, --version          output version information, then exit\n"));
+	printf(_("  -h, --help             show this help, then exit\n"));
+	printf(_("  -s, --start            from where recptr onwards to read\n"));
+	printf(_("  -e, --end              up to which recptr to read\n"));
+	printf(_("  -t, --timeline         which timeline do we want to read\n"));
+	printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
+	printf(_("  -o, --output           where to write [start, end]\n"));
+	printf(_("  -f, --file             wal file to parse\n"));
+}

Options list should be in alphabetic order (or some other less random
order). Most of these descriptions are not very intelligible (at
least without additional documentation).

True, its noticeable that this mostly was a development tool. But it
shouldn't stay that way. There have been several bugreports of late
where a bin/pg_xlogdump would have been very helpful...

This should be the PostgreSQL version.

also:

no man page

no nls.mk

Will try to provide some actually submittable version once the
xlogreader situation is finalized and the _desc routines are splitted...

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Andres Freund (#4)

Re: [PATCH 03/14] Add simple xlogdump tool

On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 +++
src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 494 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

Thanks,

Jeff

#33

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Jeff Janes (#32)

Re: [PATCH 03/14] Add simple xlogdump tool

On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:

On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 +++
src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 494 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

As-is this is just a development tool which was sorely needed for the
development of this patchset. But yes I think that once ready
(xlogreader infrastructure, *_desc routines splitted) it should
definitely be able to do most of what the above xlogdump can do and it
should live in bin/. I think mostly some filtering is missing.

That doesn't really "deprecate" the above though.

Does that answer your question?

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#34

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#23)

1 attachment(s)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas wrote:

I'm disappointed to see that there has been no progress on this
patch since last commitfest. I thought we agreed on the approach I
championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php.
There wasn't much work left to finish that, I believe.

Are you going to continue working on this?

Here's what I have right now. It's your patch, plus some tweaks such as
changing the timing for allocating readRecordBuf; I also added a struct
to contain XLogReadPage's private data, instead of using global
variables. (The main conclusion I get from this, is that it's
relatively easy to split out reading of XLog out of xlog.c; there are
some global variables still remaining, but AFAICS that should be
relatively simple to fix).

There is no consensus on the way to handle error reporting. Tom
suggests having the hypothetical client-side code redefine ereport()
somehow; as far as I can see that means we would have to reimplement
errstart, errfinish, etc. That doesn't sound all that nice to me.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

xlogreader-heikki-2.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/transam/Makefile
--- b/src/backend/access/transam/Makefile
***************
*** 14,20 **** include $(top_builddir)/src/Makefile.global
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 14,20 ----
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogreader.o xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "access/twophase.h"
  #include "access/xact.h"
  #include "access/xlog_internal.h"
+ #include "access/xlogreader.h"
  #include "access/xlogutils.h"
  #include "catalog/catversion.h"
  #include "catalog/pg_control.h"
***************
*** 192,205 **** static bool LocalHotStandbyActive = false;
   */
  static int	LocalXLogInsertAllowed = -1;
  
! /* Are we recovering using offline XLOG archives? (only valid in the startup process) */
! bool InArchiveRecovery = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
  /* options taken from recovery.conf for archive recovery */
! char *recoveryRestoreCommand = NULL;
  static char *recoveryEndCommand = NULL;
  static char *archiveCleanupCommand = NULL;
  static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
--- 193,209 ----
   */
  static int	LocalXLogInsertAllowed = -1;
  
! /*
!  * Are we recovering using offline XLOG archives? (only valid in the startup
!  * process)
!  */
! bool		InArchiveRecovery = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
  /* options taken from recovery.conf for archive recovery */
! char	   *recoveryRestoreCommand = NULL;
  static char *recoveryEndCommand = NULL;
  static char *archiveCleanupCommand = NULL;
  static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
***************
*** 210,216 **** static TimestampTz recoveryTargetTime;
  static char *recoveryTargetName;
  
  /* options taken from recovery.conf for XLOG streaming */
! bool StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
  
--- 214,220 ----
  static char *recoveryTargetName;
  
  /* options taken from recovery.conf for XLOG streaming */
! bool		StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
  
***************
*** 389,395 **** typedef struct XLogCtlData
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
! 	XLogSegNo	lastRemovedSegNo; /* latest removed/recycled XLOG segment */
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
--- 393,400 ----
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
! 	XLogSegNo	lastRemovedSegNo;		/* latest removed/recycled XLOG
! 										 * segment */
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
***************
*** 530,554 **** static XLogSegNo openLogSegNo = 0;
  static uint32 openLogOff = 0;
  
  /*
!  * These variables are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
   * page has been read into readBuf, and readSource indicates where we got
   * the currently open file from.
   */
! static int	readFile = -1;
! static XLogSegNo readSegNo = 0;
! static uint32 readOff = 0;
! static uint32 readLen = 0;
! static bool	readFileHeaderValidated = false;
! static int	readSource = 0;		/* XLOG_FROM_* code */
! 
! /*
!  * Keeps track of which sources we've tried to read the current WAL
!  * record from and failed.
!  */
! static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
  
  /*
   * These variables track when we last obtained some WAL data to process,
--- 535,563 ----
  static uint32 openLogOff = 0;
  
  /*
!  * Status data for XLogPageRead.
!  *
!  * The first three are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
   * page has been read into readBuf, and readSource indicates where we got
   * the currently open file from.
+  *
+  * failedSources keeps track of which sources we've tried to read the current
+  * WAL record from and failed.
   */
! typedef struct XLogPageReadPrivate
! {
! 	int			readFile;
! 	XLogSegNo	readSegNo;
! 	uint32		readOff;
! 	uint32		readLen;
! 	bool		readFileHeaderValidated;
! 	bool		fetching_ckpt;	/* are we fetching a checkpoint record? */
! 	int			readSource;		/* XLOG_FROM_* code */
! 	int			failedSources;	/* OR of XLOG_FROM_* codes */
! } XLogPageReadPrivate;
  
  /*
   * These variables track when we last obtained some WAL data to process,
***************
*** 559,571 **** static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
  static TimestampTz XLogReceiptTime = 0;
  static int	XLogReceiptSource = 0;		/* XLOG_FROM_* code */
  
- /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
- static char *readBuf = NULL;
- 
- /* Buffer for current ReadRecord result (expandable) */
- static char *readRecordBuf = NULL;
- static uint32 readRecordBufSize = 0;
- 
  /* State information for XLOG reading */
  static XLogRecPtr ReadRecPtr;	/* start of last record read */
  static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
--- 568,573 ----
***************
*** 608,614 **** typedef struct xl_restore_point
  
  
  static void readRecoveryCommandFile(void);
! static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void recoveryPausesHere(void);
  static void SetLatestXTime(TimestampTz xtime);
--- 610,617 ----
  
  
  static void readRecoveryCommandFile(void);
! static void exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
! 					XLogSegNo endLogSegNo);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void recoveryPausesHere(void);
  static void SetLatestXTime(TimestampTz xtime);
***************
*** 627,640 **** static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
! static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
! 			 int source, bool notexistOk);
! static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
! static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
! 			 bool randAccess);
! static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
! 							bool fetching_ckpt);
! static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
  static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
--- 630,643 ----
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
! static int XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno,
! 			 int emode, TimeLineID tli, int source, bool notexistOk);
! static int XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno,
! 				   int emode, int sources);
! static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
! 			 int emode, bool randAccess, char *readBuf, void *private_data);
! static bool WaitForWALToBecomeAvailable(XLogPageReadPrivate *private,
! 							XLogRecPtr RecPtr, bool randAccess);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
  static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
***************
*** 642,653 **** static void UpdateLastRemovedPtr(char *filename);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
  static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
! static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
! static void CheckRecoveryConsistency(void);
! static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
! static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
! 					  int emode, bool randAccess);
! static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
  static bool rescanLatestTimeLine(void);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
--- 645,657 ----
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
  static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
! static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
! 		   int emode, bool fetching_ckpt);
! static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
! static bool ValidXLogPageHeader(XLogSegNo segno, uint32 offset, int source,
! 					XLogPageHeader hdr, int emode);
! static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
! 					 XLogRecPtr RecPtr, int whichChkpt);
  static bool rescanLatestTimeLine(void);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
***************
*** 1514,1520 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
! 				 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
  				 (uint32) (XLogCtl->xlblocks[curridx] >> 32),
  				 (uint32) XLogCtl->xlblocks[curridx]);
  
--- 1518,1524 ----
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
! 			(uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
  				 (uint32) (XLogCtl->xlblocks[curridx] >> 32),
  				 (uint32) XLogCtl->xlblocks[curridx]);
  
***************
*** 1580,1588 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
  					ereport(PANIC,
  							(errcode_for_file_access(),
! 							 errmsg("could not seek in log file %s to offset %u: %m",
! 									XLogFileNameP(ThisTimeLineID, openLogSegNo),
! 									startoffset)));
  				openLogOff = startoffset;
  			}
  
--- 1584,1592 ----
  				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
  					ereport(PANIC,
  							(errcode_for_file_access(),
! 					 errmsg("could not seek in log file %s to offset %u: %m",
! 							XLogFileNameP(ThisTimeLineID, openLogSegNo),
! 							startoffset)));
  				openLogOff = startoffset;
  			}
  
***************
*** 1823,1829 **** UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  		if (!force && XLByteLT(newMinRecoveryPoint, lsn))
  			elog(WARNING,
  			   "xlog min recovery request %X/%X is past current point %X/%X",
! 				 (uint32) (lsn >> 32) , (uint32) lsn,
  				 (uint32) (newMinRecoveryPoint >> 32),
  				 (uint32) newMinRecoveryPoint);
  
--- 1827,1833 ----
  		if (!force && XLByteLT(newMinRecoveryPoint, lsn))
  			elog(WARNING,
  			   "xlog min recovery request %X/%X is past current point %X/%X",
! 				 (uint32) (lsn >> 32), (uint32) lsn,
  				 (uint32) (newMinRecoveryPoint >> 32),
  				 (uint32) newMinRecoveryPoint);
  
***************
*** 1877,1883 **** XLogFlush(XLogRecPtr record)
  		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
--- 1881,1887 ----
  		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
***************
*** 1941,1948 **** XLogFlush(XLogRecPtr record)
  		/*
  		 * Sleep before flush! By adding a delay here, we may give further
  		 * backends the opportunity to join the backlog of group commit
! 		 * followers; this can significantly improve transaction throughput, at
! 		 * the risk of increasing transaction latency.
  		 *
  		 * We do not sleep if enableFsync is not turned on, nor if there are
  		 * fewer than CommitSiblings other backends with active transactions.
--- 1945,1952 ----
  		/*
  		 * Sleep before flush! By adding a delay here, we may give further
  		 * backends the opportunity to join the backlog of group commit
! 		 * followers; this can significantly improve transaction throughput,
! 		 * at the risk of increasing transaction latency.
  		 *
  		 * We do not sleep if enableFsync is not turned on, nor if there are
  		 * fewer than CommitSiblings other backends with active transactions.
***************
*** 1957,1963 **** XLogFlush(XLogRecPtr record)
  			XLogCtlInsert *Insert = &XLogCtl->Insert;
  			uint32		freespace = INSERT_FREESPACE(Insert);
  
! 			if (freespace == 0)		/* buffer is full */
  				WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  			else
  			{
--- 1961,1967 ----
  			XLogCtlInsert *Insert = &XLogCtl->Insert;
  			uint32		freespace = INSERT_FREESPACE(Insert);
  
! 			if (freespace == 0) /* buffer is full */
  				WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  			else
  			{
***************
*** 2010,2016 **** XLogFlush(XLogRecPtr record)
  		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  }
  
  /*
--- 2014,2020 ----
  		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  }
  
  /*
***************
*** 2089,2095 **** XLogBackgroundFlush(void)
  		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
--- 2093,2099 ----
  		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
***************
*** 2329,2335 **** XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
  	if (fd < 0)
  		ereport(ERROR,
  				(errcode_for_file_access(),
! 		   errmsg("could not open file \"%s\": %m", path)));
  
  	elog(DEBUG2, "done creating and filling new WAL file");
  
--- 2333,2339 ----
  	if (fd < 0)
  		ereport(ERROR,
  				(errcode_for_file_access(),
! 				 errmsg("could not open file \"%s\": %m", path)));
  
  	elog(DEBUG2, "done creating and filling new WAL file");
  
***************
*** 2568,2575 **** XLogFileOpen(XLogSegNo segno)
   * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
! XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
! 			 int source, bool notfoundOk)
  {
  	char		xlogfname[MAXFNAMELEN];
  	char		activitymsg[MAXFNAMELEN + 16];
--- 2572,2579 ----
   * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
! XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno, int emode,
! 			 TimeLineID tli, int source, bool notfoundOk)
  {
  	char		xlogfname[MAXFNAMELEN];
  	char		activitymsg[MAXFNAMELEN + 16];
***************
*** 2616,2624 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  		XLogFilePath(xlogfpath, tli, segno);
  		if (stat(xlogfpath, &statbuf) == 0)
  		{
! 			char oldpath[MAXPGPATH];
  #ifdef WIN32
  			static unsigned int deletedcounter = 1;
  			/*
  			 * On Windows, if another process (e.g a walsender process) holds
  			 * the file open in FILE_SHARE_DELETE mode, unlink will succeed,
--- 2620,2630 ----
  		XLogFilePath(xlogfpath, tli, segno);
  		if (stat(xlogfpath, &statbuf) == 0)
  		{
! 			char		oldpath[MAXPGPATH];
! 
  #ifdef WIN32
  			static unsigned int deletedcounter = 1;
+ 
  			/*
  			 * On Windows, if another process (e.g a walsender process) holds
  			 * the file open in FILE_SHARE_DELETE mode, unlink will succeed,
***************
*** 2685,2698 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  		set_ps_display(activitymsg, false);
  
  		/* Track source of data in assorted state variables */
! 		readSource = source;
  		XLogReceiptSource = source;
  		/* In FROM_STREAM case, caller tracks receipt time, not me */
  		if (source != XLOG_FROM_STREAM)
  			XLogReceiptTime = GetCurrentTimestamp();
  
  		/* The file header needs to be validated on first access */
! 		readFileHeaderValidated = false;
  
  		return fd;
  	}
--- 2691,2704 ----
  		set_ps_display(activitymsg, false);
  
  		/* Track source of data in assorted state variables */
! 		private->readSource = source;
  		XLogReceiptSource = source;
  		/* In FROM_STREAM case, caller tracks receipt time, not me */
  		if (source != XLOG_FROM_STREAM)
  			XLogReceiptTime = GetCurrentTimestamp();
  
  		/* The file header needs to be validated on first access */
! 		private->readFileHeaderValidated = false;
  
  		return fd;
  	}
***************
*** 2709,2715 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
   * This version searches for the segment with any TLI listed in expectedTLIs.
   */
  static int
! XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  {
  	char		path[MAXPGPATH];
  	ListCell   *cell;
--- 2715,2722 ----
   * This version searches for the segment with any TLI listed in expectedTLIs.
   */
  static int
! XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno, int emode,
! 				   int sources)
  {
  	char		path[MAXPGPATH];
  	ListCell   *cell;
***************
*** 2734,2740 **** XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  
  		if (sources & XLOG_FROM_ARCHIVE)
  		{
! 			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true);
  			if (fd != -1)
  			{
  				elog(DEBUG1, "got WAL segment from archive");
--- 2741,2748 ----
  
  		if (sources & XLOG_FROM_ARCHIVE)
  		{
! 			fd = XLogFileRead(private, segno, emode, tli,
! 							  XLOG_FROM_ARCHIVE, true);
  			if (fd != -1)
  			{
  				elog(DEBUG1, "got WAL segment from archive");
***************
*** 2744,2750 **** XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  
  		if (sources & XLOG_FROM_PG_XLOG)
  		{
! 			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_XLOG, true);
  			if (fd != -1)
  				return fd;
  		}
--- 2752,2759 ----
  
  		if (sources & XLOG_FROM_PG_XLOG)
  		{
! 			fd = XLogFileRead(private, segno, emode, tli,
! 							  XLOG_FROM_PG_XLOG, true);
  			if (fd != -1)
  				return fd;
  		}
***************
*** 3177,3278 **** RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
  }
  
  /*
-  * CRC-check an XLOG record.  We do not believe the contents of an XLOG
-  * record (other than to the minimal extent of computing the amount of
-  * data to read in) until we've checked the CRCs.
-  *
-  * We assume all of the record (that is, xl_tot_len bytes) has been read
-  * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
-  * record's header, which means in particular that xl_tot_len is at least
-  * SizeOfXlogRecord, so it is safe to fetch xl_len.
-  */
- static bool
- RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
- {
- 	pg_crc32	crc;
- 	int			i;
- 	uint32		len = record->xl_len;
- 	BkpBlock	bkpb;
- 	char	   *blk;
- 	size_t		remaining = record->xl_tot_len;
- 
- 	/* First the rmgr data */
- 	if (remaining < SizeOfXLogRecord + len)
- 	{
- 		/* ValidXLogRecordHeader() should've caught this already... */
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 				(errmsg("invalid record length at %X/%X",
- 						(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 	remaining -= SizeOfXLogRecord + len;
- 	INIT_CRC32(crc);
- 	COMP_CRC32(crc, XLogRecGetData(record), len);
- 
- 	/* Add in the backup blocks, if any */
- 	blk = (char *) XLogRecGetData(record) + len;
- 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 	{
- 		uint32		blen;
- 
- 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
- 			continue;
- 
- 		if (remaining < sizeof(BkpBlock))
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("invalid backup block size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		memcpy(&bkpb, blk, sizeof(BkpBlock));
- 
- 		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("incorrect hole size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
- 
- 		if (remaining < blen)
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("invalid backup block size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		remaining -= blen;
- 		COMP_CRC32(crc, blk, blen);
- 		blk += blen;
- 	}
- 
- 	/* Check that xl_tot_len agrees with our calculation */
- 	if (remaining != 0)
- 	{
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 				(errmsg("incorrect total length in record at %X/%X",
- 						(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 
- 	/* Finally include the record header */
- 	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
- 	FIN_CRC32(crc);
- 
- 	if (!EQ_CRC32(record->xl_crc, crc))
- 	{
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
- 				(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 
- 	return true;
- }
- 
- /*
   * Attempt to read an XLOG record.
   *
   * If RecPtr is not NULL, try to read a record at that position.  Otherwise
--- 3186,3191 ----
***************
*** 3285,3605 **** RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
   * the returned record pointer always points there.
   */
  static XLogRecord *
! ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
  {
  	XLogRecord *record;
! 	XLogRecPtr	tmpRecPtr = EndRecPtr;
! 	bool		randAccess = false;
! 	uint32		len,
! 				total_len;
! 	uint32		targetRecOff;
! 	uint32		pageHeaderSize;
! 	bool		gotheader;
! 
! 	if (readBuf == NULL)
! 	{
! 		/*
! 		 * First time through, permanently allocate readBuf.  We do it this
! 		 * way, rather than just making a static array, for two reasons: (1)
! 		 * no need to waste the storage in most instantiations of the backend;
! 		 * (2) a static char array isn't guaranteed to have any particular
! 		 * alignment, whereas malloc() will provide MAXALIGN'd storage.
! 		 */
! 		readBuf = (char *) malloc(XLOG_BLCKSZ);
! 		Assert(readBuf != NULL);
! 	}
  
! 	if (RecPtr == NULL)
! 	{
! 		RecPtr = &tmpRecPtr;
! 
! 		/*
! 		 * RecPtr is pointing to end+1 of the previous WAL record.  If
! 		 * we're at a page boundary, no more records can fit on the current
! 		 * page. We must skip over the page header, but we can't do that
! 		 * until we've read in the page, since the header size is variable.
! 		 */
! 	}
! 	else
! 	{
! 		/*
! 		 * In this case, the passed-in record pointer should already be
! 		 * pointing to a valid record starting position.
! 		 */
! 		if (!XRecOffIsValid(*RecPtr))
! 			ereport(PANIC,
! 					(errmsg("invalid record offset at %X/%X",
! 							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
! 
! 		/*
! 		 * Since we are going to a random position in WAL, forget any prior
! 		 * state about what timeline we were in, and allow it to be any
! 		 * timeline in expectedTLIs.  We also set a flag to allow curFileTLI
! 		 * to go backwards (but we can't reset that variable right here, since
! 		 * we might not change files at all).
! 		 */
  		lastPageTLI = 0;		/* see comment in ValidXLogPageHeader */
- 		randAccess = true;		/* allow curFileTLI to go backwards too */
- 	}
  
! 	/* This is the first try to read this page. */
! 	failedSources = 0;
! retry:
! 	/* Read the page containing the record */
! 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
! 		return NULL;
! 
! 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
! 	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
! 	if (targetRecOff == 0)
! 	{
! 		/*
! 		 * At page start, so skip over page header.  The Assert checks that
! 		 * we're not scribbling on caller's record pointer; it's OK because we
! 		 * can only get here in the continuing-from-prev-record case, since
! 		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
! 		 */
! 		Assert(RecPtr == &tmpRecPtr);
! 		(*RecPtr) += pageHeaderSize;
! 		targetRecOff = pageHeaderSize;
! 	}
! 	else if (targetRecOff < pageHeaderSize)
! 	{
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
! 				(errmsg("invalid record offset at %X/%X",
! 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 		goto next_record_is_invalid;
! 	}
! 	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
! 		targetRecOff == pageHeaderSize)
! 	{
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
! 				(errmsg("contrecord is requested by %X/%X",
! 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 		goto next_record_is_invalid;
! 	}
  
! 	/*
! 	 * Read the record length.
! 	 *
! 	 * NB: Even though we use an XLogRecord pointer here, the whole record
! 	 * header might not fit on this page. xl_tot_len is the first field of
! 	 * the struct, so it must be on this page (the records are MAXALIGNed),
! 	 * but we cannot access any other fields until we've verified that we
! 	 * got the whole header.
! 	 */
! 	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
! 	total_len = record->xl_tot_len;
! 
! 	/*
! 	 * If the whole record header is on this page, validate it immediately.
! 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
! 	 * rest of the header after reading it from the next page.  The xl_tot_len
! 	 * check is necessary here to ensure that we enter the "Need to reassemble
! 	 * record" code path below; otherwise we might fail to apply
! 	 * ValidXLogRecordHeader at all.
! 	 */
! 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
! 	{
! 		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
! 			goto next_record_is_invalid;
! 		gotheader = true;
! 	}
! 	else
! 	{
! 		if (total_len < SizeOfXLogRecord)
! 		{
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
! 					(errmsg("invalid record length at %X/%X",
! 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 			goto next_record_is_invalid;
! 		}
! 		gotheader = false;
! 	}
! 
! 	/*
! 	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
! 	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
! 	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
! 	 * enough for all "normal" records, but very large commit or abort records
! 	 * might need more space.)
! 	 */
! 	if (total_len > readRecordBufSize)
! 	{
! 		uint32		newSize = total_len;
! 
! 		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
! 		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
! 		if (readRecordBuf)
! 			free(readRecordBuf);
! 		readRecordBuf = (char *) malloc(newSize);
! 		if (!readRecordBuf)
! 		{
! 			readRecordBufSize = 0;
! 			/* We treat this as a "bogus data" condition */
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
! 					(errmsg("record length %u at %X/%X too long",
! 							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 			goto next_record_is_invalid;
! 		}
! 		readRecordBufSize = newSize;
! 	}
! 
! 	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
! 	if (total_len > len)
  	{
! 		/* Need to reassemble record */
! 		char	   *contrecord;
! 		XLogPageHeader pageHeader;
! 		XLogRecPtr	pagelsn;
! 		char	   *buffer;
! 		uint32		gotlen;
! 
! 		/* Initialize pagelsn to the beginning of the page this record is on */
! 		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 
! 		/* Copy the first fragment of the record from the first page. */
! 		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
! 		buffer = readRecordBuf + len;
! 		gotlen = len;
! 
! 		do
  		{
! 			/* Calculate pointer to beginning of next page */
! 			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
! 			/* Wait for the next page to become available */
! 			if (!XLogPageRead(&pagelsn, emode, false, false))
! 				return NULL;
! 
! 			/* Check that the continuation on next page looks valid */
! 			pageHeader = (XLogPageHeader) readBuf;
! 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
! 			{
! 				ereport(emode_for_corrupt_record(emode, *RecPtr),
! 						(errmsg("there is no contrecord flag in log segment %s, offset %u",
! 								XLogFileNameP(curFileTLI, readSegNo),
! 								readOff)));
! 				goto next_record_is_invalid;
! 			}
! 			/*
! 			 * Cross-check that xlp_rem_len agrees with how much of the record
! 			 * we expect there to be left.
! 			 */
! 			if (pageHeader->xlp_rem_len == 0 ||
! 				total_len != (pageHeader->xlp_rem_len + gotlen))
! 			{
! 				ereport(emode_for_corrupt_record(emode, *RecPtr),
! 						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
! 								pageHeader->xlp_rem_len,
! 								XLogFileNameP(curFileTLI, readSegNo),
! 								readOff)));
! 				goto next_record_is_invalid;
! 			}
  
! 			/* Append the continuation from this page to the buffer */
! 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
! 			contrecord = (char *) readBuf + pageHeaderSize;
! 			len = XLOG_BLCKSZ - pageHeaderSize;
! 			if (pageHeader->xlp_rem_len < len)
! 				len = pageHeader->xlp_rem_len;
! 			memcpy(buffer, (char *) contrecord, len);
! 			buffer += len;
! 			gotlen += len;
! 
! 			/* If we just reassembled the record header, validate it. */
! 			if (!gotheader)
  			{
! 				record = (XLogRecord *) readRecordBuf;
! 				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
! 					goto next_record_is_invalid;
! 				gotheader = true;
  			}
! 		} while (pageHeader->xlp_rem_len > len);
! 
! 		record = (XLogRecord *) readRecordBuf;
! 		if (!RecordIsValid(record, *RecPtr, emode))
! 			goto next_record_is_invalid;
! 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
! 		XLogSegNoOffsetToRecPtr(
! 			readSegNo,
! 			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
! 			EndRecPtr);
! 		ReadRecPtr = *RecPtr;
! 	}
! 	else
! 	{
! 		/* Record does not cross a page boundary */
! 		if (!RecordIsValid(record, *RecPtr, emode))
! 			goto next_record_is_invalid;
! 		EndRecPtr = *RecPtr + MAXALIGN(total_len);
! 
! 		ReadRecPtr = *RecPtr;
! 		memcpy(readRecordBuf, record, total_len);
! 	}
! 
! 	/*
! 	 * Special processing if it's an XLOG SWITCH record
! 	 */
! 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
! 	{
! 		/* Pretend it extends to end of segment */
! 		EndRecPtr += XLogSegSize - 1;
! 		EndRecPtr -= EndRecPtr % XLogSegSize;
  
- 		/*
- 		 * Pretend that readBuf contains the last page of the segment. This is
- 		 * just to avoid Assert failure in StartupXLOG if XLOG ends with this
- 		 * segment.
- 		 */
- 		readOff = XLogSegSize - XLOG_BLCKSZ;
- 	}
  	return record;
- 
- next_record_is_invalid:
- 	failedSources |= readSource;
- 
- 	if (readFile >= 0)
- 	{
- 		close(readFile);
- 		readFile = -1;
- 	}
- 
- 	/* In standby-mode, keep trying */
- 	if (StandbyMode)
- 		goto retry;
- 	else
- 		return NULL;
  }
  
  /*
   * Check whether the xlog header of a page just read in looks valid.
   *
   * This is just a convenience subroutine to avoid duplicated code in
!  * ReadRecord.	It's not intended for use from anywhere else.
   */
  static bool
! ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  {
  	XLogRecPtr	recaddr;
  
! 	XLogSegNoOffsetToRecPtr(readSegNo, readOff, recaddr);
  
  	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("invalid magic number %04X in log segment %s, offset %u",
! 						hdr->xlp_magic,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	if (hdr->xlp_info & XLP_LONG_HEADER)
--- 3198,3268 ----
   * the returned record pointer always points there.
   */
  static XLogRecord *
! ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
! 		   bool fetching_ckpt)
  {
  	XLogRecord *record;
! 	XLogPageReadPrivate *private =
! 	(XLogPageReadPrivate *) xlogreader->private_data;
  
! 	if (!XLogRecPtrIsInvalid(RecPtr))
  		lastPageTLI = 0;		/* see comment in ValidXLogPageHeader */
  
! 	/* Set flag for XLogPageRead */
! 	private->fetching_ckpt = fetching_ckpt;
  
! 	/* This is the first try to read this page. */
! 	private->failedSources = 0;
! 	do
  	{
! 		record = XLogReadRecord(xlogreader, RecPtr, emode);
! 		ReadRecPtr = xlogreader->ReadRecPtr;
! 		EndRecPtr = xlogreader->EndRecPtr;
! 		if (record == NULL)
  		{
! 			private->failedSources |= private->readSource;
  
! 			if (private->readFile >= 0)
  			{
! 				close(private->readFile);
! 				private->readFile = -1;
  			}
! 		}
! 	} while (StandbyMode && record == NULL);
  
  	return record;
  }
  
  /*
   * Check whether the xlog header of a page just read in looks valid.
   *
   * This is just a convenience subroutine to avoid duplicated code in
!  * XLogPageRead.  It's not intended for use from anywhere else.
   */
  static bool
! ValidXLogPageHeader(XLogSegNo segno, uint32 offset, int source,
! 					XLogPageHeader hdr, int emode)
  {
  	XLogRecPtr	recaddr;
  
! 	XLogSegNoOffsetToRecPtr(segno, offset, recaddr);
  
  	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("invalid magic number %04X in log segment %s, offset %u",
! 					hdr->xlp_magic,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  	if (hdr->xlp_info & XLP_LONG_HEADER)
***************
*** 3619,3625 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  					 longhdr->xlp_sysid);
  			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
  					 ControlFile->system_identifier);
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
  							   fhdrident_str, sysident_str)));
--- 3282,3288 ----
  					 longhdr->xlp_sysid);
  			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
  					 ControlFile->system_identifier);
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
  							   fhdrident_str, sysident_str)));
***************
*** 3627,3663 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  		}
  		if (longhdr->xlp_seg_size != XLogSegSize)
  		{
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
  			return false;
  		}
  		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
  		{
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
  			return false;
  		}
  	}
! 	else if (readOff == 0)
  	{
  		/* hmm, first page of file doesn't have a long header? */
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
  	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
! 						(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
--- 3290,3326 ----
  		}
  		if (longhdr->xlp_seg_size != XLogSegSize)
  		{
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
  			return false;
  		}
  		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
  		{
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
  			return false;
  		}
  	}
! 	else if (offset == 0)
  	{
  		/* hmm, first page of file doesn't have a long header? */
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  
  	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
! 			  (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  
***************
*** 3666,3676 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  	 */
  	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
! 						hdr->xlp_tli,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
--- 3329,3339 ----
  	 */
  	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
! 					hdr->xlp_tli,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  
***************
*** 3685,3695 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  	 */
  	if (hdr->xlp_tli < lastPageTLI)
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
  						hdr->xlp_tli, lastPageTLI,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	lastPageTLI = hdr->xlp_tli;
--- 3348,3358 ----
  	 */
  	if (hdr->xlp_tli < lastPageTLI)
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
  						hdr->xlp_tli, lastPageTLI,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  	lastPageTLI = hdr->xlp_tli;
***************
*** 3697,3784 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode)
  }
  
  /*
-  * Validate an XLOG record header.
-  *
-  * This is just a convenience subroutine to avoid duplicated code in
-  * ReadRecord.	It's not intended for use from anywhere else.
-  */
- static bool
- ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
- 					  bool randAccess)
- {
- 	/*
- 	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
- 	 * required.
- 	 */
- 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- 	{
- 		if (record->xl_len != 0)
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("invalid xlog switch record at %X/%X",
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 	else if (record->xl_len == 0)
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("record with zero length at %X/%X",
- 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
- 		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
- 		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("invalid record length at %X/%X",
- 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (record->xl_rmid > RM_MAX_ID)
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("invalid resource manager ID %u at %X/%X",
- 						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (randAccess)
- 	{
- 		/*
- 		 * We can't exactly verify the prev-link, but surely it should be less
- 		 * than the record's own address.
- 		 */
- 		if (!XLByteLT(record->xl_prev, *RecPtr))
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
- 							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 	else
- 	{
- 		/*
- 		 * Record's prev-link should exactly match our previous location. This
- 		 * check guards against torn WAL pages where a stale but valid-looking
- 		 * WAL record starts on a sector boundary.
- 		 */
- 		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
- 							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 
- 	return true;
- }
- 
- /*
   * Scan for new timelines that might have appeared in the archive since we
   * started recovery.
   *
--- 3360,3365 ----
***************
*** 4739,4745 **** readRecoveryCommandFile(void)
   * Exit archive-recovery state
   */
  static void
! exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
  {
  	char		recoveryPath[MAXPGPATH];
  	char		xlogpath[MAXPGPATH];
--- 4320,4327 ----
   * Exit archive-recovery state
   */
  static void
! exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
! 					XLogSegNo endLogSegNo)
  {
  	char		recoveryPath[MAXPGPATH];
  	char		xlogpath[MAXPGPATH];
***************
*** 4758,4767 **** exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
  	 * If the ending log segment is still open, close it (to avoid problems on
  	 * Windows with trying to rename or delete an open file).
  	 */
! 	if (readFile >= 0)
  	{
! 		close(readFile);
! 		readFile = -1;
  	}
  
  	/*
--- 4340,4349 ----
  	 * If the ending log segment is still open, close it (to avoid problems on
  	 * Windows with trying to rename or delete an open file).
  	 */
! 	if (private->readFile >= 0)
  	{
! 		close(private->readFile);
! 		private->readFile = -1;
  	}
  
  	/*
***************
*** 5196,5201 **** StartupXLOG(void)
--- 4778,4785 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	XLogReaderState *xlogreader;
+ 	XLogPageReadPrivate *private;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 5329,5334 **** StartupXLOG(void)
--- 4913,4923 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
+ 	private = malloc(sizeof(XLogPageReadPrivate));
+ 	MemSet(private, 0, sizeof(XLogPageReadPrivate));
+ 	private->readFile = -1;
+ 	xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, private);
+ 
  	if (read_backup_label(&checkPointLoc, &backupEndRequired,
  						  &backupFromStandby))
  	{
***************
*** 5336,5349 **** StartupXLOG(void)
  		 * When a backup_label file is present, we want to roll forward from
  		 * the checkpoint it identifies, rather than using pg_control.
  		 */
! 		record = ReadCheckpointRecord(checkPointLoc, 0);
  		if (record != NULL)
  		{
  			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  			wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  			InRecovery = true;	/* force recovery even if SHUTDOWNED */
  
  			/*
--- 4925,4938 ----
  		 * When a backup_label file is present, we want to roll forward from
  		 * the checkpoint it identifies, rather than using pg_control.
  		 */
! 		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
  		if (record != NULL)
  		{
  			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  			wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  			InRecovery = true;	/* force recovery even if SHUTDOWNED */
  
  			/*
***************
*** 5354,5360 **** StartupXLOG(void)
  			 */
  			if (XLByteLT(checkPoint.redo, checkPointLoc))
  			{
! 				if (!ReadRecord(&(checkPoint.redo), LOG, false))
  					ereport(FATAL,
  							(errmsg("could not find redo location referenced by checkpoint record"),
  							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
--- 4943,4949 ----
  			 */
  			if (XLByteLT(checkPoint.redo, checkPointLoc))
  			{
! 				if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
  					ereport(FATAL,
  							(errmsg("could not find redo location referenced by checkpoint record"),
  							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
***************
*** 5378,5389 **** StartupXLOG(void)
  		 */
  		checkPointLoc = ControlFile->checkPoint;
  		RedoStartLSN = ControlFile->checkPointCopy.redo;
! 		record = ReadCheckpointRecord(checkPointLoc, 1);
  		if (record != NULL)
  		{
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  		}
  		else if (StandbyMode)
  		{
--- 4967,4978 ----
  		 */
  		checkPointLoc = ControlFile->checkPoint;
  		RedoStartLSN = ControlFile->checkPointCopy.redo;
! 		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
  		if (record != NULL)
  		{
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  		}
  		else if (StandbyMode)
  		{
***************
*** 5397,5408 **** StartupXLOG(void)
  		else
  		{
  			checkPointLoc = ControlFile->prevCheckPoint;
! 			record = ReadCheckpointRecord(checkPointLoc, 2);
  			if (record != NULL)
  			{
  				ereport(LOG,
  						(errmsg("using previous checkpoint record at %X/%X",
! 								(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  				InRecovery = true;		/* force recovery even if SHUTDOWNED */
  			}
  			else
--- 4986,4997 ----
  		else
  		{
  			checkPointLoc = ControlFile->prevCheckPoint;
! 			record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
  			if (record != NULL)
  			{
  				ereport(LOG,
  						(errmsg("using previous checkpoint record at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  				InRecovery = true;		/* force recovery even if SHUTDOWNED */
  			}
  			else
***************
*** 5417,5423 **** StartupXLOG(void)
  
  	ereport(DEBUG1,
  			(errmsg("redo record is at %X/%X; shutdown %s",
! 					(uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
  					wasShutdown ? "TRUE" : "FALSE")));
  	ereport(DEBUG1,
  			(errmsg("next transaction ID: %u/%u; next OID: %u",
--- 5006,5012 ----
  
  	ereport(DEBUG1,
  			(errmsg("redo record is at %X/%X; shutdown %s",
! 				  (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
  					wasShutdown ? "TRUE" : "FALSE")));
  	ereport(DEBUG1,
  			(errmsg("next transaction ID: %u/%u; next OID: %u",
***************
*** 5698,5704 **** StartupXLOG(void)
  		 * Allow read-only connections immediately if we're consistent
  		 * already.
  		 */
! 		CheckRecoveryConsistency();
  
  		/*
  		 * Find the first record that logically follows the checkpoint --- it
--- 5287,5293 ----
  		 * Allow read-only connections immediately if we're consistent
  		 * already.
  		 */
! 		CheckRecoveryConsistency(EndRecPtr);
  
  		/*
  		 * Find the first record that logically follows the checkpoint --- it
***************
*** 5707,5718 **** StartupXLOG(void)
  		if (XLByteLT(checkPoint.redo, RecPtr))
  		{
  			/* back up to find the record */
! 			record = ReadRecord(&(checkPoint.redo), PANIC, false);
  		}
  		else
  		{
  			/* just have to read next record after CheckPoint */
! 			record = ReadRecord(NULL, LOG, false);
  		}
  
  		if (record != NULL)
--- 5296,5307 ----
  		if (XLByteLT(checkPoint.redo, RecPtr))
  		{
  			/* back up to find the record */
! 			record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
  		}
  		else
  		{
  			/* just have to read next record after CheckPoint */
! 			record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
  		}
  
  		if (record != NULL)
***************
*** 5727,5733 **** StartupXLOG(void)
  
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  
  			/*
  			 * main redo apply loop
--- 5316,5322 ----
  
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
! 						 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  
  			/*
  			 * main redo apply loop
***************
*** 5743,5750 **** StartupXLOG(void)
  
  					initStringInfo(&buf);
  					appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
! 									 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
! 									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr);
  					xlog_outrec(&buf, record);
  					appendStringInfo(&buf, " - ");
  					RmgrTable[record->xl_rmid].rm_desc(&buf,
--- 5332,5339 ----
  
  					initStringInfo(&buf);
  					appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
! 							 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr);
  					xlog_outrec(&buf, record);
  					appendStringInfo(&buf, " - ");
  					RmgrTable[record->xl_rmid].rm_desc(&buf,
***************
*** 5759,5765 **** StartupXLOG(void)
  				HandleStartupProcInterrupts();
  
  				/* Allow read-only connections if we're consistent now */
! 				CheckRecoveryConsistency();
  
  				/*
  				 * Have we reached our recovery target?
--- 5348,5354 ----
  				HandleStartupProcInterrupts();
  
  				/* Allow read-only connections if we're consistent now */
! 				CheckRecoveryConsistency(EndRecPtr);
  
  				/*
  				 * Have we reached our recovery target?
***************
*** 5863,5869 **** StartupXLOG(void)
  
  				LastRec = ReadRecPtr;
  
! 				record = ReadRecord(NULL, LOG, false);
  			} while (record != NULL && recoveryContinue);
  
  			/*
--- 5452,5458 ----
  
  				LastRec = ReadRecPtr;
  
! 				record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
  			} while (record != NULL && recoveryContinue);
  
  			/*
***************
*** 5872,5878 **** StartupXLOG(void)
  
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  			xtime = GetLatestXTime();
  			if (xtime)
  				ereport(LOG,
--- 5461,5467 ----
  
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
! 						 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  			xtime = GetLatestXTime();
  			if (xtime)
  				ereport(LOG,
***************
*** 5913,5919 **** StartupXLOG(void)
  	 * Re-fetch the last valid or last applied record, so we can identify the
  	 * exact endpoint of what we consider the valid portion of WAL.
  	 */
! 	record = ReadRecord(&LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
  	XLByteToPrevSeg(EndOfLog, endLogSegNo);
  
--- 5502,5508 ----
  	 * Re-fetch the last valid or last applied record, so we can identify the
  	 * exact endpoint of what we consider the valid portion of WAL.
  	 */
! 	record = ReadRecord(xlogreader, LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
  	XLByteToPrevSeg(EndOfLog, endLogSegNo);
  
***************
*** 5976,5982 **** StartupXLOG(void)
  	 */
  	if (InArchiveRecovery)
  	{
! 		char	reason[200];
  
  		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
  		ereport(LOG,
--- 5565,5571 ----
  	 */
  	if (InArchiveRecovery)
  	{
! 		char		reason[200];
  
  		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
  		ereport(LOG,
***************
*** 6017,6023 **** StartupXLOG(void)
  	 * we will use that below.)
  	 */
  	if (InArchiveRecovery)
! 		exitArchiveRecovery(curFileTLI, endLogSegNo);
  
  	/*
  	 * Prepare to write WAL starting at EndOfLog position, and init xlog
--- 5606,5612 ----
  	 * we will use that below.)
  	 */
  	if (InArchiveRecovery)
! 		exitArchiveRecovery(private, curFileTLI, endLogSegNo);
  
  	/*
  	 * Prepare to write WAL starting at EndOfLog position, and init xlog
***************
*** 6036,6043 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
  	Insert->currpos = (char *) Insert->currpage +
  		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
  
--- 5625,5639 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	if (EndOfLog % XLOG_BLCKSZ == 0)
! 	{
! 		memset(Insert->currpage, 0, XLOG_BLCKSZ);
! 	}
! 	else
! 	{
! 		Assert(private->readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
! 		memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
! 	}
  	Insert->currpos = (char *) Insert->currpage +
  		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
  
***************
*** 6189,6210 **** StartupXLOG(void)
  		ShutdownRecoveryTransactionEnvironment();
  
  	/* Shut down readFile facility, free space */
! 	if (readFile >= 0)
! 	{
! 		close(readFile);
! 		readFile = -1;
! 	}
! 	if (readBuf)
  	{
! 		free(readBuf);
! 		readBuf = NULL;
! 	}
! 	if (readRecordBuf)
! 	{
! 		free(readRecordBuf);
! 		readRecordBuf = NULL;
! 		readRecordBufSize = 0;
  	}
  
  	/*
  	 * If any of the critical GUCs have changed, log them before we allow
--- 5785,5799 ----
  		ShutdownRecoveryTransactionEnvironment();
  
  	/* Shut down readFile facility, free space */
! 	private = (XLogPageReadPrivate *) xlogreader->private_data;
! 	if (private->readFile >= 0)
  	{
! 		close(private->readFile);
! 		private->readFile = -1;
  	}
+ 	if (xlogreader->private_data)
+ 		free(xlogreader->private_data);
+ 	XLogReaderFree(xlogreader);
  
  	/*
  	 * If any of the critical GUCs have changed, log them before we allow
***************
*** 6235,6241 **** StartupXLOG(void)
   * that it can start accepting read-only connections.
   */
  static void
! CheckRecoveryConsistency(void)
  {
  	/*
  	 * During crash recovery, we don't reach a consistent state until we've
--- 5824,5830 ----
   * that it can start accepting read-only connections.
   */
  static void
! CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
  {
  	/*
  	 * During crash recovery, we don't reach a consistent state until we've
***************
*** 6415,6421 **** LocalSetXLogInsertAllowed(void)
   * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
   */
  static XLogRecord *
! ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
  {
  	XLogRecord *record;
  
--- 6004,6010 ----
   * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
   */
  static XLogRecord *
! ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
  {
  	XLogRecord *record;
  
***************
*** 6439,6445 **** ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
  		return NULL;
  	}
  
! 	record = ReadRecord(&RecPtr, LOG, true);
  
  	if (record == NULL)
  	{
--- 6028,6034 ----
  		return NULL;
  	}
  
! 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
  
  	if (record == NULL)
  	{
***************
*** 6667,6673 **** GetRecoveryTargetTLI(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
! 	TimeLineID result;
  
  	SpinLockAcquire(&xlogctl->info_lck);
  	result = xlogctl->RecoveryTargetTLI;
--- 6256,6262 ----
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
! 	TimeLineID	result;
  
  	SpinLockAcquire(&xlogctl->info_lck);
  	result = xlogctl->RecoveryTargetTLI;
***************
*** 6952,6958 **** CreateCheckPoint(int flags)
  		XLogRecPtr	curInsert;
  
  		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert == ControlFile->checkPoint + 
  			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
  			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
  		{
--- 6541,6547 ----
  		XLogRecPtr	curInsert;
  
  		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert == ControlFile->checkPoint +
  			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
  			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
  		{
***************
*** 7382,7388 **** CreateRestartPoint(int flags)
  	{
  		ereport(DEBUG2,
  				(errmsg("skipping restartpoint, already performed at %X/%X",
! 						(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo)));
  
  		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  		if (flags & CHECKPOINT_IS_SHUTDOWN)
--- 6971,6977 ----
  	{
  		ereport(DEBUG2,
  				(errmsg("skipping restartpoint, already performed at %X/%X",
! 		(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo)));
  
  		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  		if (flags & CHECKPOINT_IS_SHUTDOWN)
***************
*** 7492,7498 **** CreateRestartPoint(int flags)
  	xtime = GetLatestXTime();
  	ereport((log_checkpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
  		   xtime ? errdetail("last completed transaction was at log time %s",
  							 timestamptz_to_str(xtime)) : 0));
  
--- 7081,7087 ----
  	xtime = GetLatestXTime();
  	ereport((log_checkpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 		 (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
  		   xtime ? errdetail("last completed transaction was at log time %s",
  							 timestamptz_to_str(xtime)) : 0));
  
***************
*** 8017,8023 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  				   "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
! 						 (uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
--- 7606,7612 ----
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  				   "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
! 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
***************
*** 8198,8204 **** assign_xlog_sync_method(int new_sync_method, void *extra)
  				ereport(PANIC,
  						(errcode_for_file_access(),
  						 errmsg("could not fsync log segment %s: %m",
! 								XLogFileNameP(ThisTimeLineID, openLogSegNo))));
  			if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
  				XLogFileClose();
  		}
--- 7787,7793 ----
  				ereport(PANIC,
  						(errcode_for_file_access(),
  						 errmsg("could not fsync log segment %s: %m",
! 							  XLogFileNameP(ThisTimeLineID, openLogSegNo))));
  			if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
  				XLogFileClose();
  		}
***************
*** 8229,8236 **** issue_xlog_fsync(int fd, XLogSegNo segno)
  			if (pg_fsync_writethrough(fd) != 0)
  				ereport(PANIC,
  						(errcode_for_file_access(),
! 						 errmsg("could not fsync write-through log file %s: %m",
! 								XLogFileNameP(ThisTimeLineID, segno))));
  			break;
  #endif
  #ifdef HAVE_FDATASYNC
--- 7818,7825 ----
  			if (pg_fsync_writethrough(fd) != 0)
  				ereport(PANIC,
  						(errcode_for_file_access(),
! 					  errmsg("could not fsync write-through log file %s: %m",
! 							 XLogFileNameP(ThisTimeLineID, segno))));
  			break;
  #endif
  #ifdef HAVE_FDATASYNC
***************
*** 8259,8264 **** char *
--- 7848,7854 ----
  XLogFileNameP(TimeLineID tli, XLogSegNo segno)
  {
  	char	   *result = palloc(MAXFNAMELEN);
+ 
  	XLogFileName(result, tli, segno);
  	return result;
  }
***************
*** 8504,8512 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
! 						 (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename);
  		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
! 						 (uint32) (checkpointloc >> 32), (uint32) checkpointloc);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
--- 8094,8102 ----
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
! 			 (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename);
  		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
! 					 (uint32) (checkpointloc >> 32), (uint32) checkpointloc);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
***************
*** 8854,8860 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  				 errmsg("could not create file \"%s\": %m",
  						histfilepath)));
  	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 			(uint32) (startpoint >> 32), (uint32) startpoint, startxlogfilename);
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			(uint32) (stoppoint >> 32), (uint32) stoppoint, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
--- 8444,8450 ----
  				 errmsg("could not create file \"%s\": %m",
  						histfilepath)));
  	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 		(uint32) (startpoint >> 32), (uint32) startpoint, startxlogfilename);
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			(uint32) (stoppoint >> 32), (uint32) stoppoint, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
***************
*** 9262,9288 **** CancelBackup(void)
   * sleep and retry.
   */
  static bool
! XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
! 			 bool randAccess)
  {
  	uint32		targetPageOff;
  	uint32		targetRecOff;
  	XLogSegNo	targetSegNo;
  
! 	XLByteToSeg(*RecPtr, targetSegNo);
! 	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
  
  	/* Fast exit if we have read the record in the current buffer already */
! 	if (failedSources == 0 && targetSegNo == readSegNo &&
! 		targetPageOff == readOff && targetRecOff < readLen)
  		return true;
  
  	/*
  	 * See if we need to switch to a new segment because the requested record
  	 * is not in the currently open one.
  	 */
! 	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
  	{
  		/*
  		 * Request a restartpoint if we've replayed too much xlog since the
--- 8852,8879 ----
   * sleep and retry.
   */
  static bool
! XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
! 			 bool randAccess, char *readBuf, void *private_data)
  {
+ 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) private_data;
  	uint32		targetPageOff;
  	uint32		targetRecOff;
  	XLogSegNo	targetSegNo;
  
! 	XLByteToSeg(RecPtr, targetSegNo);
! 	targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 	targetRecOff = RecPtr % XLOG_BLCKSZ;
  
  	/* Fast exit if we have read the record in the current buffer already */
! 	if (private->failedSources == 0 && targetSegNo == private->readSegNo &&
! 		targetPageOff == private->readOff && targetRecOff < private->readLen)
  		return true;
  
  	/*
  	 * See if we need to switch to a new segment because the requested record
  	 * is not in the currently open one.
  	 */
! 	if (private->readFile >= 0 && !XLByteInSeg(RecPtr, private->readSegNo))
  	{
  		/*
  		 * Request a restartpoint if we've replayed too much xlog since the
***************
*** 9290,9325 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
  		 */
  		if (StandbyMode && bgwriterLaunched)
  		{
! 			if (XLogCheckpointNeeded(readSegNo))
  			{
  				(void) GetRedoRecPtr();
! 				if (XLogCheckpointNeeded(readSegNo))
  					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  			}
  		}
  
! 		close(readFile);
! 		readFile = -1;
! 		readSource = 0;
  	}
  
! 	XLByteToSeg(*RecPtr, readSegNo);
  
  retry:
  	/* See if we need to retrieve more data */
! 	if (readFile < 0 ||
! 		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
  	{
  		if (StandbyMode)
  		{
! 			if (!WaitForWALToBecomeAvailable(*RecPtr, randAccess,
! 											 fetching_ckpt))
  				goto triggered;
  		}
  		else
  		{
  			/* In archive or crash recovery. */
! 			if (readFile < 0)
  			{
  				int			sources;
  
--- 8881,8916 ----
  		 */
  		if (StandbyMode && bgwriterLaunched)
  		{
! 			if (XLogCheckpointNeeded(private->readSegNo))
  			{
  				(void) GetRedoRecPtr();
! 				if (XLogCheckpointNeeded(private->readSegNo))
  					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  			}
  		}
  
! 		close(private->readFile);
! 		private->readFile = -1;
! 		private->readSource = 0;
  	}
  
! 	XLByteToSeg(RecPtr, private->readSegNo);
  
  retry:
  	/* See if we need to retrieve more data */
! 	if (private->readFile < 0 ||
! 		(private->readSource == XLOG_FROM_STREAM &&
! 		 !XLByteLT(RecPtr, receivedUpto)))
  	{
  		if (StandbyMode)
  		{
! 			if (!WaitForWALToBecomeAvailable(private, RecPtr, randAccess))
  				goto triggered;
  		}
  		else
  		{
  			/* In archive or crash recovery. */
! 			if (private->readFile < 0)
  			{
  				int			sources;
  
***************
*** 9331,9338 **** retry:
  				if (InArchiveRecovery)
  					sources |= XLOG_FROM_ARCHIVE;
  
! 				readFile = XLogFileReadAnyTLI(readSegNo, emode, sources);
! 				if (readFile < 0)
  					return false;
  			}
  		}
--- 8922,8931 ----
  				if (InArchiveRecovery)
  					sources |= XLOG_FROM_ARCHIVE;
  
! 				private->readFile =
! 					XLogFileReadAnyTLI(private, private->readSegNo, emode,
! 									   sources);
! 				if (private->readFile < 0)
  					return false;
  			}
  		}
***************
*** 9342,9348 **** retry:
  	 * At this point, we have the right segment open and if we're streaming we
  	 * know the requested record is in it.
  	 */
! 	Assert(readFile != -1);
  
  	/*
  	 * If the current segment is being streamed from master, calculate how
--- 8935,8941 ----
  	 * At this point, we have the right segment open and if we're streaming we
  	 * know the requested record is in it.
  	 */
! 	Assert(private->readFile != -1);
  
  	/*
  	 * If the current segment is being streamed from master, calculate how
***************
*** 9350,9368 **** retry:
  	 * requested record has been received, but this is for the benefit of
  	 * future calls, to allow quick exit at the top of this function.
  	 */
! 	if (readSource == XLOG_FROM_STREAM)
  	{
! 		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
  		{
! 			readLen = XLOG_BLCKSZ;
  		}
  		else
! 			readLen = receivedUpto % XLogSegSize - targetPageOff;
  	}
  	else
! 		readLen = XLOG_BLCKSZ;
  
! 	if (!readFileHeaderValidated && targetPageOff != 0)
  	{
  		/*
  		 * Whenever switching to a new WAL segment, we read the first page of
--- 8943,8961 ----
  	 * requested record has been received, but this is for the benefit of
  	 * future calls, to allow quick exit at the top of this function.
  	 */
! 	if (private->readSource == XLOG_FROM_STREAM)
  	{
! 		if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
  		{
! 			private->readLen = XLOG_BLCKSZ;
  		}
  		else
! 			private->readLen = receivedUpto % XLogSegSize - targetPageOff;
  	}
  	else
! 		private->readLen = XLOG_BLCKSZ;
  
! 	if (!private->readFileHeaderValidated && targetPageOff != 0)
  	{
  		/*
  		 * Whenever switching to a new WAL segment, we read the first page of
***************
*** 9371,9432 **** retry:
  		 * identification info that is present in the first page's "long"
  		 * header.
  		 */
! 		readOff = 0;
! 		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  		{
! 			char fname[MAXFNAMELEN];
! 			XLogFileName(fname, curFileTLI, readSegNo);
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
  					(errcode_for_file_access(),
! 					 errmsg("could not read from log segment %s, offset %u: %m",
! 							fname, readOff)));
  			goto next_record_is_invalid;
  		}
! 		if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode))
  			goto next_record_is_invalid;
  	}
  
  	/* Read the requested page */
! 	readOff = targetPageOff;
! 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
  	{
! 		char fname[MAXFNAMELEN];
! 		XLogFileName(fname, curFileTLI, readSegNo);
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
  				(errcode_for_file_access(),
! 		 errmsg("could not seek in log segment %s to offset %u: %m",
! 				fname, readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  	{
! 		char fname[MAXFNAMELEN];
! 		XLogFileName(fname, curFileTLI, readSegNo);
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
  				(errcode_for_file_access(),
! 		 errmsg("could not read from log segment %s, offset %u: %m",
! 				fname, readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode))
  		goto next_record_is_invalid;
  
! 	readFileHeaderValidated = true;
  
! 	Assert(targetSegNo == readSegNo);
! 	Assert(targetPageOff == readOff);
! 	Assert(targetRecOff < readLen);
  
  	return true;
  
  next_record_is_invalid:
! 	failedSources |= readSource;
  
! 	if (readFile >= 0)
! 		close(readFile);
! 	readFile = -1;
! 	readLen = 0;
! 	readSource = 0;
  
  	/* In standby-mode, keep trying */
  	if (StandbyMode)
--- 8964,9032 ----
  		 * identification info that is present in the first page's "long"
  		 * header.
  		 */
! 		private->readOff = 0;
! 		if (read(private->readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  		{
! 			char		fname[MAXFNAMELEN];
! 
! 			XLogFileName(fname, curFileTLI, private->readSegNo);
! 			ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  					(errcode_for_file_access(),
! 				  errmsg("could not read from log segment %s, offset %u: %m",
! 						 fname, private->readOff)));
  			goto next_record_is_invalid;
  		}
! 		if (!ValidXLogPageHeader(private->readSegNo, private->readOff,
! 							   private->readSource, (XLogPageHeader) readBuf,
! 								 emode))
  			goto next_record_is_invalid;
  	}
  
  	/* Read the requested page */
! 	private->readOff = targetPageOff;
! 	if (lseek(private->readFile, (off_t) private->readOff, SEEK_SET) < 0)
  	{
! 		char		fname[MAXFNAMELEN];
! 
! 		XLogFileName(fname, curFileTLI, private->readSegNo);
! 		ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  				(errcode_for_file_access(),
! 				 errmsg("could not seek in log segment %s to offset %u: %m",
! 						fname, private->readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (read(private->readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  	{
! 		char		fname[MAXFNAMELEN];
! 
! 		XLogFileName(fname, curFileTLI, private->readSegNo);
! 		ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  				(errcode_for_file_access(),
! 				 errmsg("could not read from log segment %s, offset %u: %m",
! 						fname, private->readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (!ValidXLogPageHeader(private->readSegNo, private->readOff,
! 							 private->readSource, (XLogPageHeader) readBuf,
! 							 emode))
  		goto next_record_is_invalid;
  
! 	private->readFileHeaderValidated = true;
  
! 	Assert(targetSegNo == private->readSegNo);
! 	Assert(targetPageOff == private->readOff);
! 	Assert(targetRecOff < private->readLen);
  
  	return true;
  
  next_record_is_invalid:
! 	private->failedSources |= private->readSource;
  
! 	if (private->readFile >= 0)
! 		close(private->readFile);
! 	private->readFile = -1;
! 	private->readLen = 0;
! 	private->readSource = 0;
  
  	/* In standby-mode, keep trying */
  	if (StandbyMode)
***************
*** 9435,9445 **** next_record_is_invalid:
  		return false;
  
  triggered:
! 	if (readFile >= 0)
! 		close(readFile);
! 	readFile = -1;
! 	readLen = 0;
! 	readSource = 0;
  
  	return false;
  }
--- 9035,9045 ----
  		return false;
  
  triggered:
! 	if (private->readFile >= 0)
! 		close(private->readFile);
! 	private->readFile = -1;
! 	private->readLen = 0;
! 	private->readSource = 0;
  
  	return false;
  }
***************
*** 9456,9463 **** triggered:
   * false.
   */
  static bool
! WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
! 							bool fetching_ckpt)
  {
  	static pg_time_t last_fail_time = 0;
  
--- 9056,9063 ----
   * false.
   */
  static bool
! WaitForWALToBecomeAvailable(XLogPageReadPrivate *private, XLogRecPtr RecPtr,
! 							bool randAccess)
  {
  	static pg_time_t last_fail_time = 0;
  
***************
*** 9476,9482 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			 * the archive should be identical to what was streamed, so it's
  			 * unlikely that it helps, but one can hope...
  			 */
! 			if (failedSources & XLOG_FROM_STREAM)
  			{
  				ShutdownWalRcv();
  				continue;
--- 9076,9082 ----
  			 * the archive should be identical to what was streamed, so it's
  			 * unlikely that it helps, but one can hope...
  			 */
! 			if (private->failedSources & XLOG_FROM_STREAM)
  			{
  				ShutdownWalRcv();
  				continue;
***************
*** 9515,9535 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			if (havedata)
  			{
  				/*
! 				 * Great, streamed far enough.  Open the file if it's not open
  				 * already.  Use XLOG_FROM_STREAM so that source info is set
  				 * correctly and XLogReceiptTime isn't changed.
  				 */
! 				if (readFile < 0)
  				{
! 					readFile = XLogFileRead(readSegNo, PANIC,
! 											recoveryTargetTLI,
! 											XLOG_FROM_STREAM, false);
! 					Assert(readFile >= 0);
  				}
  				else
  				{
  					/* just make sure source info is correct... */
! 					readSource = XLOG_FROM_STREAM;
  					XLogReceiptSource = XLOG_FROM_STREAM;
  				}
  				break;
--- 9115,9136 ----
  			if (havedata)
  			{
  				/*
! 				 * Great, streamed far enough.	Open the file if it's not open
  				 * already.  Use XLOG_FROM_STREAM so that source info is set
  				 * correctly and XLogReceiptTime isn't changed.
  				 */
! 				if (private->readFile < 0)
  				{
! 					private->readFile =
! 						XLogFileRead(private, private->readSegNo, PANIC,
! 									 recoveryTargetTLI,
! 									 XLOG_FROM_STREAM, false);
! 					Assert(private->readFile >= 0);
  				}
  				else
  				{
  					/* just make sure source info is correct... */
! 					private->readSource = XLOG_FROM_STREAM;
  					XLogReceiptSource = XLOG_FROM_STREAM;
  				}
  				break;
***************
*** 9558,9567 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			int			sources;
  			pg_time_t	now;
  
! 			if (readFile >= 0)
  			{
! 				close(readFile);
! 				readFile = -1;
  			}
  			/* Reset curFileTLI if random fetch. */
  			if (randAccess)
--- 9159,9168 ----
  			int			sources;
  			pg_time_t	now;
  
! 			if (private->readFile >= 0)
  			{
! 				close(private->readFile);
! 				private->readFile = -1;
  			}
  			/* Reset curFileTLI if random fetch. */
  			if (randAccess)
***************
*** 9572,9583 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			 * from pg_xlog.
  			 */
  			sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
! 			if (!(sources & ~failedSources))
  			{
  				/*
  				 * We've exhausted all options for retrieving the file. Retry.
  				 */
! 				failedSources = 0;
  
  				/*
  				 * Before we sleep, re-scan for possible new timelines if we
--- 9173,9184 ----
  			 * from pg_xlog.
  			 */
  			sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
! 			if (!(sources & ~private->failedSources))
  			{
  				/*
  				 * We've exhausted all options for retrieving the file. Retry.
  				 */
! 				private->failedSources = 0;
  
  				/*
  				 * Before we sleep, re-scan for possible new timelines if we
***************
*** 9606,9635 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  				 * stream the missing WAL, before retrying to restore from
  				 * archive/pg_xlog.
  				 *
! 				 * If fetching_ckpt is TRUE, RecPtr points to the initial
! 				 * checkpoint location. In that case, we use RedoStartLSN as
! 				 * the streaming start position instead of RecPtr, so that
! 				 * when we later jump backwards to start redo at RedoStartLSN,
! 				 * we will have the logs streamed already.
  				 */
  				if (PrimaryConnInfo)
  				{
! 					XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
  
  					RequestXLogStreaming(ptr, PrimaryConnInfo);
  					continue;
  				}
  			}
  			/* Don't try to read from a source that just failed */
! 			sources &= ~failedSources;
! 			readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, sources);
! 			if (readFile >= 0)
  				break;
  
  			/*
  			 * Nope, not found in archive and/or pg_xlog.
  			 */
! 			failedSources |= sources;
  
  			/*
  			 * Check to see if the trigger file exists. Note that we do this
--- 9207,9238 ----
  				 * stream the missing WAL, before retrying to restore from
  				 * archive/pg_xlog.
  				 *
! 				 * If we're fetching a checkpoint record, RecPtr points to the
! 				 * initial checkpoint location. In that case, we use
! 				 * RedoStartLSN as the streaming start position instead of
! 				 * RecPtr, so that when we later jump backwards to start redo
! 				 * at RedoStartLSN, we will have the logs streamed already.
  				 */
  				if (PrimaryConnInfo)
  				{
! 					XLogRecPtr	ptr = private->fetching_ckpt ?
! 					RedoStartLSN : RecPtr;
  
  					RequestXLogStreaming(ptr, PrimaryConnInfo);
  					continue;
  				}
  			}
  			/* Don't try to read from a source that just failed */
! 			sources &= ~private->failedSources;
! 			private->readFile = XLogFileReadAnyTLI(private, private->readSegNo,
! 												   DEBUG2, sources);
! 			if (private->readFile >= 0)
  				break;
  
  			/*
  			 * Nope, not found in archive and/or pg_xlog.
  			 */
! 			private->failedSources |= sources;
  
  			/*
  			 * Check to see if the trigger file exists. Note that we do this
***************
*** 9669,9680 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
   * you are about to ereport(), or you might cause a later message to be
   * erroneously suppressed.
   */
! static int
! emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  {
  	static XLogRecPtr lastComplaint = 0;
  
! 	if (readSource == XLOG_FROM_PG_XLOG && emode == LOG)
  	{
  		if (XLByteEQ(RecPtr, lastComplaint))
  			emode = DEBUG1;
--- 9272,9283 ----
   * you are about to ereport(), or you might cause a later message to be
   * erroneously suppressed.
   */
! int
! emode_for_corrupt_record(int emode, int source, XLogRecPtr RecPtr)
  {
  	static XLogRecPtr lastComplaint = 0;
  
! 	if (source == XLOG_FROM_PG_XLOG && emode == LOG)
  	{
  		if (XLByteEQ(RecPtr, lastComplaint))
  			emode = DEBUG1;
*** /dev/null
--- b/src/backend/access/transam/xlogreader.c
***************
*** 0 ****
--- 1,532 ----
+ /*-------------------------------------------------------------------------
+  *
+  * xlogreader.c
+  *		Generic xlog reading facility
+  *
+  * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *		src/backend/access/transam/xlogreader.c
+  *
+  * NOTES
+  *		Documentation about how do use this interface can be found in
+  *		xlogreader.h, more specifically in the definition of the
+  *		XLogReaderState struct where all parameters are documented.
+  *
+  * TODO:
+  * * usable without backend code around
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/transam.h"
+ #include "access/xlog_internal.h"
+ #include "access/xlogreader.h"
+ #include "catalog/pg_control.h"
+ 
+ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
+ static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ 					  XLogRecord *record, int emode, bool randAccess);
+ static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+ 
+ /*
+  * Initialize a new xlog reader
+  */
+ XLogReaderState *
+ XLogReaderAllocate(XLogRecPtr startpoint,
+ 				   XLogPageReadCB pagereadfunc, void *private_data)
+ {
+ 	XLogReaderState *state;
+ 
+ 	state = (XLogReaderState *) palloc0(sizeof(XLogReaderState));
+ 
+ 	/*
+ 	 * Permanently allocate readBuf.  We do it this way, rather than just
+ 	 * making a static array, for two reasons: (1) no need to waste the
+ 	 * storage in most instantiations of the backend; (2) a static char array
+ 	 * isn't guaranteed to have any particular alignment, whereas malloc()
+ 	 * will provide MAXALIGN'd storage.
+ 	 */
+ 	state->readBuf = (char *) malloc(XLOG_BLCKSZ);
+ 
+ 	state->read_page = pagereadfunc;
+ 	state->private_data = private_data;
+ 	state->EndRecPtr = startpoint;
+ 
+ 	/*
+ 	 * Allocate an initial readRecordBuf of minimal size, which can later be
+ 	 * enlarged if necessary.
+ 	 */
+ 	if (!allocate_recordbuf(state, 0))
+ 	{
+ 		free(state->readBuf);
+ 		pfree(state);
+ 		return NULL;
+ 	}
+ 
+ 	return state;
+ }
+ 
+ void
+ XLogReaderFree(XLogReaderState *state)
+ {
+ 	if (state->readRecordBuf)
+ 		free(state->readRecordBuf);
+ 	free(state->readBuf);
+ 	pfree(state);
+ }
+ 
+ /*
+  * Allocate readRecordBuf to fit a record of at least the given length.
+  * Returns true if successful, false if out of memory.
+  *
+  * readRecordBufSize is set to the new buffer size.
+  *
+  * To avoid useless small increases, round its size to a multiple of
+  * XLOG_BLCKSZ, and make sure it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start
+  * with.  (That is enough for all "normal" records, but very large commit or
+  * abort records might need more space.)
+  */
+ static bool
+ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
+ {
+ 	uint32		newSize = reclength;
+ 
+ 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+ 	newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+ 
+ 	if (state->readRecordBuf)
+ 		free(state->readRecordBuf);
+ 	state->readRecordBuf = (char *) malloc(newSize);
+ 	if (!state->readRecordBuf)
+ 	{
+ 		state->readRecordBufSize = 0;
+ 		return false;
+ 	}
+ 
+ 	state->readRecordBufSize = newSize;
+ 	return true;
+ }
+ 
+ /*
+  * Attempt to read an XLOG record.
+  *
+  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
+  * try to read a record just after the last one previously read.
+  *
+  * If no valid record is available, returns NULL, or fails if emode is PANIC.
+  * (emode must be either PANIC, LOG)
+  *
+  * The record is copied into readRecordBuf, so that on successful return,
+  * the returned record pointer always points there.
+  */
+ XLogRecord *
+ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+ {
+ 	XLogRecord *record;
+ 	XLogRecPtr	tmpRecPtr = state->EndRecPtr;
+ 	bool		randAccess = false;
+ 	uint32		len,
+ 				total_len;
+ 	uint32		targetRecOff;
+ 	uint32		pageHeaderSize;
+ 	bool		gotheader;
+ 
+ 	if (RecPtr == InvalidXLogRecPtr)
+ 	{
+ 		RecPtr = tmpRecPtr;
+ 
+ 		/*
+ 		 * RecPtr is pointing to end+1 of the previous WAL record.	If we're
+ 		 * at a page boundary, no more records can fit on the current page. We
+ 		 * must skip over the page header, but we can't do that until we've
+ 		 * read in the page, since the header size is variable.
+ 		 */
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * In this case, the passed-in record pointer should already be
+ 		 * pointing to a valid record starting position.
+ 		 */
+ 		if (!XRecOffIsValid(RecPtr))
+ 			ereport(PANIC,
+ 					(errmsg("invalid record offset at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		randAccess = true;		/* allow curFileTLI to go backwards too */
+ 	}
+ 
+ 	/* Read the page containing the record */
+ 	if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf,
+ 						  state->private_data))
+ 		return NULL;
+ 
+ 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ 	targetRecOff = RecPtr % XLOG_BLCKSZ;
+ 	if (targetRecOff == 0)
+ 	{
+ 		/*
+ 		 * At page start, so skip over page header.  The Assert checks that
+ 		 * we're not scribbling on caller's record pointer; it's OK because we
+ 		 * can only get here in the continuing-from-prev-record case, since
+ 		 * XRecOffIsValid rejected the zero-page-offset case otherwise. XXX:
+ 		 * does this assert make sense now that RecPtr is not a pointer?
+ 		 */
+ 		Assert(RecPtr == tmpRecPtr);
+ 		RecPtr += pageHeaderSize;
+ 		targetRecOff = pageHeaderSize;
+ 	}
+ 	else if (targetRecOff < pageHeaderSize)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid record offset at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+ 		targetRecOff == pageHeaderSize)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("contrecord is requested by %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 
+ 	/*
+ 	 * Read the record length.
+ 	 *
+ 	 * NB: Even though we use an XLogRecord pointer here, the whole record
+ 	 * header might not fit on this page. xl_tot_len is the first field of the
+ 	 * struct, so it must be on this page (the records are MAXALIGNed), but we
+ 	 * cannot access any other fields until we've verified that we got the
+ 	 * whole header.
+ 	 */
+ 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+ 	total_len = record->xl_tot_len;
+ 
+ 	/*
+ 	 * If the whole record header is on this page, validate it immediately.
+ 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+ 	 * rest of the header after reading it from the next page.	The xl_tot_len
+ 	 * check is necessary here to ensure that we enter the "Need to reassemble
+ 	 * record" code path below; otherwise we might fail to apply
+ 	 * ValidXLogRecordHeader at all.
+ 	 */
+ 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+ 	{
+ 		if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode,
+ 								   randAccess))
+ 			return NULL;
+ 		gotheader = true;
+ 	}
+ 	else
+ 	{
+ 		if (total_len < SizeOfXLogRecord)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("invalid record length at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return NULL;
+ 		}
+ 		gotheader = false;
+ 	}
+ 
+ 	/*
+ 	 * Enlarge readRecordBuf as needed.
+ 	 */
+ 	if (total_len > state->readRecordBufSize &&
+ 		!allocate_recordbuf(state, total_len))
+ 	{
+ 		/* We treat this as a "bogus data" condition */
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("record length %u at %X/%X too long",
+ 					  total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 
+ 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+ 	if (total_len > len)
+ 	{
+ 		/* Need to reassemble record */
+ 		char	   *contrecord;
+ 		XLogPageHeader pageHeader;
+ 		XLogRecPtr	pagelsn;
+ 		char	   *buffer;
+ 		uint32		gotlen;
+ 
+ 		/* Initialize pagelsn to the beginning of the page this record is on */
+ 		pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+ 
+ 		/* Copy the first fragment of the record from the first page. */
+ 		memcpy(state->readRecordBuf,
+ 			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+ 		buffer = state->readRecordBuf + len;
+ 		gotlen = len;
+ 
+ 		do
+ 		{
+ 			/* Calculate pointer to beginning of next page */
+ 			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+ 			/* Wait for the next page to become available */
+ 			if (!state->read_page(state, pagelsn, emode, false, state->readBuf,
+ 								  state->private_data))
+ 				return NULL;
+ 
+ 			/* Check that the continuation on next page looks valid */
+ 			pageHeader = (XLogPageHeader) state->readBuf;
+ 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+ 			{
+ 				ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 						(errmsg("there is no contrecord flag at %X/%X",
+ 								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 				return NULL;
+ 			}
+ 
+ 			/*
+ 			 * Cross-check that xlp_rem_len agrees with how much of the record
+ 			 * we expect there to be left.
+ 			 */
+ 			if (pageHeader->xlp_rem_len == 0 ||
+ 				total_len != (pageHeader->xlp_rem_len + gotlen))
+ 			{
+ 				ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 						(errmsg("invalid contrecord length %u at %X/%X",
+ 								pageHeader->xlp_rem_len,
+ 								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 				return NULL;
+ 			}
+ 
+ 			/* Append the continuation from this page to the buffer */
+ 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+ 			contrecord = (char *) state->readBuf + pageHeaderSize;
+ 			len = XLOG_BLCKSZ - pageHeaderSize;
+ 			if (pageHeader->xlp_rem_len < len)
+ 				len = pageHeader->xlp_rem_len;
+ 			memcpy(buffer, (char *) contrecord, len);
+ 			buffer += len;
+ 			gotlen += len;
+ 
+ 			/* If we just reassembled the record header, validate it. */
+ 			if (!gotheader)
+ 			{
+ 				record = (XLogRecord *) state->readRecordBuf;
+ 				if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record,
+ 										   emode, randAccess))
+ 					return NULL;
+ 				gotheader = true;
+ 			}
+ 		} while (pageHeader->xlp_rem_len > len);
+ 
+ 		record = (XLogRecord *) state->readRecordBuf;
+ 		if (!RecordIsValid(record, RecPtr, emode))
+ 			return NULL;
+ 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ 		state->ReadRecPtr = RecPtr;
+ 		state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+ 	}
+ 	else
+ 	{
+ 		/* Record does not cross a page boundary */
+ 		if (!RecordIsValid(record, RecPtr, emode))
+ 			return NULL;
+ 		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+ 
+ 		state->ReadRecPtr = RecPtr;
+ 		memcpy(state->readRecordBuf, record, total_len);
+ 	}
+ 
+ 	/*
+ 	 * Special processing if it's an XLOG SWITCH record
+ 	 */
+ 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ 	{
+ 		/* Pretend it extends to end of segment */
+ 		state->EndRecPtr += XLogSegSize - 1;
+ 		state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+ 	}
+ 
+ 	return record;
+ }
+ 
+ /*
+  * Validate an XLOG record header.
+  *
+  * This is just a convenience subroutine to avoid duplicated code in
+  * XLogReadRecord.	It's not intended for use from anywhere else.
+  */
+ static bool
+ ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ 					  XLogRecord *record, int emode, bool randAccess)
+ {
+ 	/*
+ 	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+ 	 * required.
+ 	 */
+ 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ 	{
+ 		if (record->xl_len != 0)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("invalid xlog switch record at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 	else if (record->xl_len == 0)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("record with zero length at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+ 		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+ 		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid record length at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (record->xl_rmid > RM_MAX_ID)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid resource manager ID %u at %X/%X",
+ 						record->xl_rmid, (uint32) (RecPtr >> 32),
+ 						(uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (randAccess)
+ 	{
+ 		/*
+ 		 * We can't exactly verify the prev-link, but surely it should be less
+ 		 * than the record's own address.
+ 		 */
+ 		if (!XLByteLT(record->xl_prev, RecPtr))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ 							(uint32) (record->xl_prev >> 32),
+ 							(uint32) record->xl_prev,
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Record's prev-link should exactly match our previous location. This
+ 		 * check guards against torn WAL pages where a stale but valid-looking
+ 		 * WAL record starts on a sector boundary.
+ 		 */
+ 		if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ 							(uint32) (record->xl_prev >> 32),
+ 							(uint32) record->xl_prev,
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 
+ 	return true;
+ }
+ 
+ 
+ /*
+  * CRC-check an XLOG record.  We do not believe the contents of an XLOG
+  * record (other than to the minimal extent of computing the amount of
+  * data to read in) until we've checked the CRCs.
+  *
+  * We assume all of the record (that is, xl_tot_len bytes) has been read
+  * into memory at *record.	Also, ValidXLogRecordHeader() has accepted the
+  * record's header, which means in particular that xl_tot_len is at least
+  * SizeOfXlogRecord, so it is safe to fetch xl_len.
+  */
+ static bool
+ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+ {
+ 	pg_crc32	crc;
+ 	int			i;
+ 	uint32		len = record->xl_len;
+ 	BkpBlock	bkpb;
+ 	char	   *blk;
+ 	size_t		remaining = record->xl_tot_len;
+ 
+ 	/* First the rmgr data */
+ 	if (remaining < SizeOfXLogRecord + len)
+ 	{
+ 		/* ValidXLogRecordHeader() should've caught this already... */
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 				(errmsg("invalid record length at %X/%X",
+ 						(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 	remaining -= SizeOfXLogRecord + len;
+ 	INIT_CRC32(crc);
+ 	COMP_CRC32(crc, XLogRecGetData(record), len);
+ 
+ 	/* Add in the backup blocks, if any */
+ 	blk = (char *) XLogRecGetData(record) + len;
+ 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ 	{
+ 		uint32		blen;
+ 
+ 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
+ 			continue;
+ 
+ 		if (remaining < sizeof(BkpBlock))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("invalid backup block size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		memcpy(&bkpb, blk, sizeof(BkpBlock));
+ 
+ 		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("incorrect hole size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+ 
+ 		if (remaining < blen)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("invalid backup block size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		remaining -= blen;
+ 		COMP_CRC32(crc, blk, blen);
+ 		blk += blen;
+ 	}
+ 
+ 	/* Check that xl_tot_len agrees with our calculation */
+ 	if (remaining != 0)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 				(errmsg("incorrect total length in record at %X/%X",
+ 						(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 
+ 	/* Finally include the record header */
+ 	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+ 	FIN_CRC32(crc);
+ 
+ 	if (!EQ_CRC32(record->xl_crc, crc))
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
+ 				(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 
+ 	return true;
+ }
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 231,236 **** extern XLogRecPtr RequestXLogSwitch(void);
--- 231,244 ----
  
  extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
  
+ 
+ /*
+  * Exported so that xlogreader.c can call this. TODO: Should be refactored
+  * into a callback, or just have xlogreader return the error string and have
+  * the caller of XLogReadRecord() do the ereport() call.
+  */
+ extern int	emode_for_corrupt_record(int emode, int readSource, XLogRecPtr RecPtr);
+ 
  /*
   * Exported for the functions in timeline.c and xlogarchive.c.  Only valid
   * in the startup process.
*** /dev/null
--- b/src/include/access/xlogreader.h
***************
*** 0 ****
--- 1,97 ----
+ /*-------------------------------------------------------------------------
+  *
+  * readxlog.h
+  *
+  *		Generic xlog reading facility.
+  *
+  * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *		src/include/access/xlogreader.h
+  *
+  * NOTES
+  *		Check the definition of the XLogReaderState struct for instructions on
+  *		how to use the XLogReader infrastructure.
+  *
+  *		The basic idea is to allocate an XLogReaderState via
+  *		XLogReaderAllocate, and call XLogReadRecord() until it returns NULL.
+  *-------------------------------------------------------------------------
+  */
+ #ifndef XLOGREADER_H
+ #define XLOGREADER_H
+ 
+ #include "access/xlog_internal.h"
+ 
+ struct XLogReaderState;
+ 
+ /*
+  * The callbacks are explained in more detail inside the XLogReaderState
+  * struct.
+  */
+ typedef bool (*XLogPageReadCB) (struct XLogReaderState *state,
+ 											XLogRecPtr RecPtr, int emode,
+ 											bool randAccess,
+ 											char *readBuf,
+ 											void *private_data);
+ 
+ typedef struct XLogReaderState
+ {
+ 	/* ----------------------------------------
+ 	 * Public parameters
+ 	 * ----------------------------------------
+ 	 */
+ 
+ 	/*
+ 	 * Data input callback (mandatory).
+ 	 *
+ 	 * This callback shall read XLOG_BLKSZ bytes, from the location 'RecPtr',
+ 	 * into memory pointed at by 'readBuf' parameter.  The callback shall
+ 	 * return true on success, false if the page could not be read.
+ 	 */
+ 	XLogPageReadCB read_page;
+ 
+ 	/*
+ 	 * Opaque data for callbacks to use.  Not used by XLogReader.
+ 	 */
+ 	void	   *private_data;
+ 
+ 	/*
+ 	 * From where to where are we reading
+ 	 */
+ 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
+ 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+ 
+ 	/* ----------------------------------------
+ 	 * private/internal state
+ 	 * ----------------------------------------
+ 	 */
+ 
+ 	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+ 	char	   *readBuf;
+ 
+ 	/* Buffer for current ReadRecord result (expandable) */
+ 	char	   *readRecordBuf;
+ 	uint32		readRecordBufSize;
+ } XLogReaderState;
+ 
+ /*
+  * Get a new XLogReader
+  *
+  * At least the read_page callback, startptr and endptr have to be set before
+  * the reader can be used.
+  */
+ extern XLogReaderState *XLogReaderAllocate(XLogRecPtr startpoint,
+ 				   XLogPageReadCB pagereadfunc, void *private_data);
+ 
+ /*
+  * Free an XLogReader
+  */
+ extern void XLogReaderFree(XLogReaderState *state);
+ 
+ /*
+  * Read the next record from xlog. Returns NULL on end-of-WAL or on failure.
+  */
+ extern XLogRecord *XLogReadRecord(XLogReaderState *state, XLogRecPtr ptr,
+ 			   int emode);
+ 
+ #endif   /* XLOGREADER_H */

#35

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#20)

Re: logical changeset generation v3

Do you have a git repository or something where all the 14 patches are
applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its
email...

On Thu, Nov 15, 2012 at 6:34 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On Thursday, November 15, 2012 05:08:26 AM Michael Paquier wrote:

Looks like cool stuff @-@
I might be interested in looking at that a bit as I think I will

hopefully

be hopefully be able to grab some time in the next couple of weeks.
Are some of those patches already submitted to a CF?

I added the patchset as one entry to the CF this time, it seems to me they
are
too hard to judge individually to make them really separately reviewable.

I can split it off there, but really all the complicated stuff is in one
patch
anyway...

Greetings,

Andres

--
Michael Paquier
http://michael.otacoo.com

#36

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#6)

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

Hi,

This patch looks OK.

I got 3 comments:
1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
3455? It does not look necessary.
2) You should perhaps change the header of RelationMapFilenodeToOid so as
not mentionning it as the opposite operation of RelationMapOidToFilenode
but as an operation that looks for the OID of a relation based on its
relfilenode. Both functions are opposite but independent.
3) Both functions are doing similar operations. Could it be possible to
wrap them in the same central function?

On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <andres@2ndquadrant.com>wrote:

---
src/backend/utils/cache/relmapper.c | 53
+++++++++++++++++++++++++++++++++++++
src/include/catalog/indexing.h | 4 +--
src/include/utils/relmapper.h | 2 ++
3 files changed, 57 insertions(+), 2 deletions(-)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Michael Paquier
http://michael.otacoo.com

#37

Andrea Suisani

sickpig@opinioni.net

about 13 years ago

In reply to: Michael Paquier (#35)

Re: logical changeset generation v3

Il 16/11/2012 05:34, Michael Paquier ha scritto:

Do you have a git repository or something where all the 14 patches are applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php

Andrea

#38

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#36)

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:

This patch looks OK.

I got 3 comments:
1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

2) You should perhaps change the header of RelationMapFilenodeToOid so as
not mentionning it as the opposite operation of RelationMapOidToFilenode
but as an operation that looks for the OID of a relation based on its
relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

3) Both functions are doing similar operations. Could it be possible
to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

Thanks,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39

Markus Wanner

markus@bluegap.ch

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:

In response to this you will soon find the 14 patches that currently
implement $subject.

Congratulations on that piece of work.

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.

Let's start with the representation: I so far considered the Postgres-R
change set format to be an implementation detail and I don't intend it
to be readable by humans or third party tools. It's thus binary only and
doesn't offer a textual representation. The approach presented here
seems to target different formats for different audiences, including
binary representations. More general, less specific.

Next, contents: this proposal is more verbose. In the textual
representation shown, it provides (usually redundant) information about
attribute names and types. Postgres-R doesn't ever transmit attribute
name or type information for INSERT, UPDATE or DELETE operations.
Instead, it relies on attribute numbers and pg_attributes being at some
known consistent state.

Let's compare by example:

table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
table "replication_example": DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

table "replication_example": INSERT: VALUES (1, 1, '1')
table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
table "replication_example": DELETE: PKEY(1) COID(78)

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.

To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.

In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

I hope you found this to be an insightful and fair comparison.

Regards

Markus Wanner

#40

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Markus Wanner (#39)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

Hi Markus,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:

On 11/15/2012 01:27 AM, Andres Freund wrote:

In response to this you will soon find the 14 patches that currently
implement $subject.

Congratulations on that piece of work.

Thanks.

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R.

Uh, sorry to interrupt you right here, but thats not the "proposed
format" ;) Thats just an example output plugin that people wished
for. For the use-case were after we (as in 2ndq) also want to use binary
data. Its also rather useful for debugging and such.

I generally aggree that the presented format is too verbose for actual
replication, but it seems fine enough for showing off ;)

If you look at Patch 12/14 "Add a simple decoding module in contrib
named 'test_decoding'" you can see that adding a different output format
should be pretty straight forward.

Which output plugin is used is determined by the initial
INIT_LOGICAL_REPLICATION '$plugin'; command in a replication connection.

To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.

There also is the point that if you do the processing inside heap_* you
need to make sure the replication targeted data is safely received &
fsynced away, in "our" case thats not necessary as WAL already provides
crash safety, so should the replication connection break you can simply
start from the location last confirmed as being safely sent.

As we want to provide asynchronous replication thats a rather major
point.

In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

One major aim definitely was optionally be able to replicate into just
about any target system, so yes, I certainly agree.

I hope you found this to be an insightful and fair comparison.

Yes, input in general and especially from other replication providers is
certainly interesting and important!

Thanks,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#41

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Markus Wanner (#39)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

Hi,

On 2012-11-16 14:46:39 +0100, Markus Wanner wrote:

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

Whats the data type of the "COID" in -R?

In the patchset the output plugin has enough data to get the old xid and
the new xid in the case of updates (not in the case of deletes, but
thats a small bug and should be fixable with a single line of code), and
it has enough information to extract the primary key without problems.

I wonder whether we also should track the xid epoch...

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.

I don't know the exact Postgres-R algorithm (but I queued reading some
papers you referred to when we talked), but I guess what we have in mind
is roughly similar - its just not even remotely part of this patchset ;)
Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#42

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Andres Freund (#10)

Re: [PATCH 09/14] Adjust all *Satisfies routines to take a HeapTuple instead of a HeapTupleHeader

On 12-11-14 08:17 PM, Andres Freund wrote:

For the regular satisfies routines this is needed in prepareation of logical
decoding. I changed the non-regular ones for consistency as well.

The naming between htup, tuple and similar is rather confused, I could not find
any consistent naming anywhere.

This is preparatory work for the logical decoding feature which needs to be
able to get to a valid relfilenode from when checking the visibility of a
tuple.

I have taken a look at this patch. The patch does what it says, it
changes a bunch of
HeapTupleSatisfiesXXXX routines to take a HeapTuple structure instead of
a HeapTupleHeader.
It also sets the HeapTuple.t_tableOid value before calling these routines.

The patch does not modify these routines to actually do anything useful
with the additional data fields though it does add some assertions to
make sure t_tableOid is set.

The patch compiles cleanly and the unit tests pass.

This patch does not seem to depend on any of the other patches in this
set and applies cleanly against master. The patch doesn't actually add
any functionality, unless someone sees a reason for complaining about
this that I don't see, then I think it can be committed.

Steve

Show quoted text

---
contrib/pgrowlocks/pgrowlocks.c | 2 +-
src/backend/access/heap/heapam.c | 13 ++++++----
src/backend/access/heap/pruneheap.c | 16 ++++++++++--
src/backend/catalog/index.c | 2 +-
src/backend/commands/analyze.c | 3 ++-
src/backend/commands/cluster.c | 2 +-
src/backend/commands/vacuumlazy.c | 3 ++-
src/backend/storage/lmgr/predicate.c | 2 +-
src/backend/utils/time/tqual.c | 50 +++++++++++++++++++++++++++++-------
src/include/utils/snapshot.h | 4 +--
src/include/utils/tqual.h | 20 +++++++--------
11 files changed, 83 insertions(+), 34 deletions(-)

#43

Markus Wanner

markus@bluegap.ch

about 13 years ago

In reply to: Andres Freund (#40)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/16/2012 03:05 PM, Andres Freund wrote:

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R.

Uh, sorry to interrupt you right here, but thats not the "proposed
format" ;)

Understood. Sorry, I didn't mean to imply that. It's pretty obvious to
me that this is more of a human readable format and that others,
including binary formats, can be implemented. I apologize for the bad
wording of a "proposed format", which doesn't make that clear.

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.

There also is the point that if you do the processing inside heap_* you
need to make sure the replication targeted data is safely received &
fsynced away, in "our" case thats not necessary as WAL already provides
crash safety, so should the replication connection break you can simply
start from the location last confirmed as being safely sent.

In the case of Postgres-R, the "safely received" part isn't really
handled at the change set level at all. And regarding the fsync
guarantee: you can well use the WAL to provide that, without basing
change set generation on in. In that regard, Postgres-R is probably the
more general approach: you can run Postgres-R with WAL turned off
entirely - which may well make sense if you take into account the vast
amount of cloud resources available, which don't have a BBWC. Instead of
WAL, you can add more nodes at more different locations. And no, you
don't want your database to ever go down, anyway :-)

In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

One major aim definitely was optionally be able to replicate into just
about any target system, so yes, I certainly agree.

I'm glad I got that correct ;-)

Regards

Markus Wanner

#44

Markus Wanner

markus@bluegap.ch

about 13 years ago

In reply to: Andres Freund (#41)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/16/2012 03:14 PM, Andres Freund wrote:

Whats the data type of the "COID" in -R?

It's short for CommitOrderId, a 32bit global transaction identifier,
being wrapped-around, very much like TransactionIds are. (In that sense,
it's global, but unique only for a certain amount of time).

In the patchset the output plugin has enough data to get the old xid and
the new xid in the case of updates (not in the case of deletes, but
thats a small bug and should be fixable with a single line of code), and
it has enough information to extract the primary key without problems.

It's the xmin of the old tuple that Postgres-R needs to get the COID.

Regards

Markus Wanner

#45

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#38)

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:

This patch looks OK.

I got 3 comments:
1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

2) You should perhaps change the header of RelationMapFilenodeToOid so as
not mentionning it as the opposite operation of RelationMapOidToFilenode
but as an operation that looks for the OID of a relation based on its
relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

3) Both functions are doing similar operations. Could it be possible
to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

OK thanks for your answers.
As this patch only adds a new function and is not that much complicated, I
think there is no problem in committing it. The only thing to remove is the
diff in indexing.h. Could someone take care of that?
If other people have additional comments on the ability to perform a
relfileoid->reloid operation for cached maps, of course go ahead.
--
Michael Paquier
http://michael.otacoo.com

#46

Hannu Krosing

hannu@2ndQuadrant.com

about 13 years ago

In reply to: Markus Wanner (#39)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/16/2012 02:46 PM, Markus Wanner wrote:

Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:

In response to this you will soon find the 14 patches that currently
implement $subject.

Congratulations on that piece of work.

I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.

...

Let's compare by example:

table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
table "replication_example": DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

table "replication_example": INSERT: VALUES (1, 1, '1')
table "replication_example": UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
table "replication_example": DELETE: PKEY(1) COID(78)

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

Hannu

#47

Markus Wanner

markus@bluegap.ch

about 13 years ago

In reply to: Hannu Krosing (#46)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner

#48

Hannu Krosing

hannu@2ndQuadrant.com

about 13 years ago

In reply to: Markus Wanner (#47)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was
interested
if you had found any better way of doing this :)

Hannu

Show quoted text

Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner

#49

Hannu Krosing

hannu@2ndQuadrant.com

about 13 years ago

In reply to: Markus Wanner (#47)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

...

Are there other reasons to want tables without primary keys that I'm
missing?

Nope. The only place a table without a primary key would be needed is a
log table, but as these are (supposed to be) INSERT-only this is not a
problem for them.

Hannu

#50

Markus Wanner

markus@bluegap.ch

about 13 years ago

In reply to: Hannu Krosing (#48)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

Hannu,

On 11/17/2012 03:40 PM, Hannu Krosing wrote:

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

That doesn't sound like it could possibly work for Postgres-R. At least
not when there can be multiple rows with all the same attributes, i.e.
without a unique key constraint over all columns.

Otherwise, some nodes could detect two concurrent UPDATES as a conflict,
while other nodes select different rows and don't handle it as a conflict.

Regards

Markus Wanner

#51

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3 - Source for Slony

First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the
following question

1) How might a future version of slony be able to use logical
replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me
from implementing or prototyping the above.

Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events from
sl_event for a batch of SYNC events. Then the remote helper thread
queries data from sl_log_1 and sl_log_2. I see this changing, instead
the slony remote listener thread would connect to the remote system and
get a logical replication stream.

1) Would slony connect as a normal client connection and call
something like 'select pg_slony_process_xlog(...)' to get bunch of
logical replication
change records to process.
OR
2) Would slony connect as a replication connection similar to how the
pg_receivelog program does today and then process the logical changeset
outputs. Instead of writing it to a file (as pg_receivelog does)

It seems that the second approach is what is encouraged. I think we
would put a lot of the pg_receivelog functionality into slon and it
would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the
slony logical replication plugin. Slon would also have to provide
feedback to the walsender about what it has processed so the origin
database knows what catalog snapshots can be expired. Based on
eyeballing pg_receivelog.c it seems that about half the code in the 700
file is related to command line arguments etc, and the other half is
related to looping over the copy out stream, sending feedback and other
things that we would need to duplicate in slon.

pg_receivelog.c has comment:

/*
* We have to use postgres.h not postgres_fe.h here, because there's so
much
* backend-only stuff in the XLOG include files we need. But we need a
* frontend-ish environment otherwise. Hence this ugly hack.
*/

This looks like more of a carryover from pg_receivexlog.c. From what I
can tell we can eliminate the postgres.h include if we also eliminate
the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak
up. I recall that in the past slon actually did include postgres.h and
it caused some issues (I think with MSVC win32 builds). Since
pg_receivelog.c will be used as a starting point/sample for third
parties to write client programs it would be better if it didn't
encourage client programs to include postgres.h

The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client we
will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a
COMMIT WAL record it will call my plugins
.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see one
transaction at a time passed to it in commit order. It won't see
.change(t1) followed by .change (t2), followed by a second .change(t1).
The reorder buffer code hides me from all that complexity (yah)

From a slony point of view I think the output of the plugin will be
rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type,
cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs. Ie [
{id=1},{name='steve'},{project='slony'}]

I don't t think our output plugin will be much more complicated than the
test_decoding plugin. I suspect we will want to give it the ability to
filter out non-replicated tables. We will also have to filter out
change records that didn't originate on the local-node that aren't part
of a cascaded subscription. Remember that in a two node cluster slony
will have connections from A-->B and from B--->A even if user tables
only flow one way. Data that is replicated from A into B will show up in
the WAL stream for B.

Exactly how we do this filtering is an open question, I think the
output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy
to figure out if the output plugin is able/allowed to query its
database. Will this be possible? I would expect to be able to query the
database as it exists now(at plugin invocation time) not as it existed
in the past when the WAL was generated. In addition to the node ID I
can see us wanting to be able to query other slony tables
(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too. It
would be nice if we could pass extra, arbitrary data/parameters to the
output plugins that could include that, or other things. At the moment
the start_logical_replication rule in repl_gram.y doesn't allow for that
but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out
data in this stage.

<editorial> Everything above deals with the postgresql side of things,
ie the patch in question or the plugin API we would have to work with.
Much of what is below deals with slony side change and might of limited
interest to some on pgsql-hackers
</editorial>

Slon Applying Changes
================

The next task we will have is to make slon and the replica instance be
able to apply these changes. In slony 2.2 we do a COPY from sl_log and
apply that stream to a table on the replica with COPY. We then have
triggers on the replica that decode the command_args and apply the
changes as
INSERT/UPDATE/DELETE statements on the user tables. I see this
continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the
logical replication stream that change sl_event. In many cases we won't
just be inserting records into sl_event but we will need to instead do
the logic in remote_worker.c for processing the different types of
events. Worst case we could parse the change records we receive from
our version pg_receivellog and split the sl_event records out into a
sl_event stream and a sl_log stream. Another approach might be to have
the slony apply trigger build up a list of events that the slon
remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates
one way. Confirm (sl_confirm) entries go from a replica back to an
origin. In a two node origin->replica scenario for data, the way I see
this working is that the slon for the origin would connect to the
replica (like it does today).
It would receive the logical replication records, but since it isn't
subscribed to any tables it won't receive/process the WAL for
user-tables but it will still receive/process sl_confirm rows. It will
then insert the rows in sl_confirm that it 'replicated' from the remote
node.

With what I have described so far, Slony would then be receiving a
stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar [id=1, something='somethingelse']
t1-commit
t2- insert into foo [....]
t2-commit
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]
t3-commit

Even though, from a data-correctness point of view, slony could commit
the transaction on the replica after it sees the t1 commit, we won't
want it to do commits other than on a SYNC boundary. This means that
the replicas will continue to move between consistent SYNC snapshots and
that we can still track the state/progress of replication by knowing
what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide feedback to the
walsender on SYNC boundaries after the transaction has committed on the
receiver. I don't see this as being an issue.

Setting up Subscriptions
===================
At first we have a slon cluster with just 1 node, life is good. When a
second node is created and a path(or pair of paths) are defined between
the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.
a. Get the current xlog recptr,
b. Query any non-sync events of interest from sl_event.
2. Connect to the remote node with a logical replication connection and
start streaming logical replication changes start at the recptr we retrieved
above.

Slon will then receive any future events from the remote sl_event as
part of the logical replication stream. It won't receive any user
tables because it isn't yet subscribed to any.

When a subscription is started, the SUBSCRIBE_SET and
ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO
sl_event will be part of a change record in the replication stream and
be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of
the tables in the replication set such that any changes made to the
tables after the copy is started get included in the replication
stream. The approach proposed in the DESIGN.TXT file with exporting a
snapshot sounds okay for this. I *think* slony could get by with
something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the
tables in the set
2. have the slon(copy_set) wait until any transactions on the origin,
that started prior to the ENABLE_SUBSCRIPTION, are committed.
Slony does this today as part of the copy_set logic.
3. Get/remember the snapshot visibility information for the COPY's
transaction
4. When we start to process change records we need to filter out
records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit
awkward/ugly. This isn't an issue today because we are already using
the transaction visibility information for selecting from sl_log so it
works, but above I had proposed stripping the xid from the logical
change records.

Cascading Replication
=================
A-->B--->C

The slon for B will insert records from A into B's tables. This insert
will generate WAL records on B. The slon for C should be able to pull
the data it needs (both sl_event entries with ev_origin=A, and user
table data originating on A) from B's logical replication stream. I
don't see any issues here nor do I see a need to 'cache' the data in an
sl_log type of table on B.

Reshaping Replication
=================

In Slony replication is reshaped by two types events, a MOVE SET and a
FAILOVER.

Move Set:
A replication set might be subscribed in a cascaded fashion like
A--->B--->C

When a MOVE SET is issued node A will stop accepting new write
transactions for tables in the set. A MOVE_SET(1,A,B) event is then put
into sl_event on node A. Node A will then stop accepting new
transactions on the tables in set 1.
Node B receives the MOVE_SET command in the proper order, after it has
processed the last SYNC generated on A when A was still accepting write
transactions to those tables. When Node B processes the MOVE_SET event
then node B starts accepting write transactions on the tables. Node B
will also generates an ACCEPT_SET event. Node C will then receive the
MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after
all SYNC events from A with data changes to the set) and then knows that
it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the
data source.

FAILOVER:
---------------
A---->B
| .
v .
C

Today with slony, if B is a valid failover target then it is a
forwarding node of the set. This means that B keeps a record in sl_log
of any changes originating on A until B knows that node C has received
those changes. In the event of a failover, if node C is far behind, it
can just get the missing data from sl_log on node B (the failover
target/new origin).

I see a problem with what I have discussed above, B won't explicitly
store the data from A in sl_log, a cascaded node would depend on B's WAL
stream.
The problem is that at FAILOVER time, B might have processed some
changes from A. Node C might also be processing Node B's WAL stream for
events (or data from another set). Node C will discard/not receive the
data for A's tables since it isn't subscribed to those tables from B.
What happens then if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because
node A has failed, and it can't get it from node B because node C has
already processed the WAL changes from node B that included the data but
it ignored/discarded it. Maybe node C could reprocess older WAL from
node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
==========================
I think so, the incomplete areas I see are the ones that mentioned in
the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even
though we'd need them before I could build a production system.

Conclusions
=============
I like this design much better than the original design from the spring
that would have required keeping a catalog proxy on the decoding
machine. Based on what I've seen it should be possible to make slony
use logical replication as a source for events instead of triggers
populating sl_log.
My thinking is that we want a way for logreceiver programs to pass
arguments/parameters to the output plugins. Beyond that this looks like
something slony can use.

#52

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Steve Singer (#51)

Re: logical changeset generation v3 - Source for Slony

Hi Steve!

On 2012-11-17 22:50:35 -0500, Steve Singer wrote:

First, you can add me to the list of people saying 'wow', I'm impressed.

Thanks!

The approach I am taking to reviewing this to try and answer the following
question

1) How might a future version of slony be able to use logical replication as
described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me from
implementing or prototyping the above.

Sounds like a good plan to me.

Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events from
sl_event for a batch of SYNC events. Then the remote helper thread queries
data from sl_log_1 and sl_log_2. I see this changing, instead the slony
remote listener thread would connect to the remote system and get a logical
replication stream.

1) Would slony connect as a normal client connection and call something
like 'select pg_slony_process_xlog(...)' to get bunch of logical replication
change records to process.
OR
2) Would slony connect as a replication connection similar to how the
pg_receivelog program does today and then process the logical changeset
outputs. Instead of writing it to a file (as pg_receivelog does)

It would need to be the latter. We need the feedback messages it sends
for several purposes:
- increasing the lowered xmin
- implementing optionally synchronous replication at some point
- using 1) would mean having transactions open...

It seems that the second approach is what is encouraged. I think we would
put a lot of the pg_receivelog functionality into slon and it would issue a
command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical
replication plugin. Slon would also have to provide feedback to the
walsender about what it has processed so the origin database knows what
catalog snapshots can be expired. Based on eyeballing pg_receivelog.c it
seems that about half the code in the 700 file is related to command line
arguments etc, and the other half is related to looping over the copy out
stream, sending feedback and other things that we would need to duplicate in
slon.

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

pg_receivelog.c has comment:

(its pg_receivellog btw. ;))

/*
* We have to use postgres.h not postgres_fe.h here, because there's so much
* backend-only stuff in the XLOG include files we need. But we need a
* frontend-ish environment otherwise. Hence this ugly hack.
*/

This looks like more of a carryover from pg_receivexlog.c. From what I can
tell we can eliminate the postgres.h include if we also eliminate the
utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak up.
I recall that in the past slon actually did include postgres.h and it caused
some issues (I think with MSVC win32 builds). Since pg_receivelog.c will be
used as a starting point/sample for third parties to write client programs
it would be better if it didn't encourage client programs to include
postgres.h

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{
write(outfd, message->data, message->length);
if (received_100_messages)
{
fsync(outfd);
StreamingLLogConfirm(message);
}
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.

The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client we will
need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a COMMIT
WAL record it will call my plugins
.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see one
transaction at a time passed to it in commit order. It won't see
.change(t1) followed by .change (t2), followed by a second .change(t1). The
reorder buffer code hides me from all that complexity (yah)

Correct.

From a slony point of view I think the output of the plugin will be rows,
suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type,
cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs. Ie [
{id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

I don't t think our output plugin will be much more complicated than the
test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

I suspect we will want to give it the ability to
filter out non-replicated tables. We will also have to filter out change
records that didn't originate on the local-node that aren't part of a
cascaded subscription. Remember that in a two node cluster slony will have
connections from A-->B and from B--->A even if user tables only flow one
way. Data that is replicated from A into B will show up in the WAL stream
for B.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid
- never visible outside of wal/system catalogs
- externally visible it gets
- assigned an uuid
- optionally assigned a user defined name
- user settable (permissions?) origin when executing sql:
- SET change_origin_uuid = 'uuid';
- SET change_origin_name = 'user-settable-name';
- defaults to the local node
- decoding callbacks get passed the origin of a change
- txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins,
so the others don't even get the ReorderBuffer treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

Exactly how we do this filtering is an open question, I think the output
plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy to
figure out if the output plugin is able/allowed to query its database. Will
this be possible? I would expect to be able to query the database as it
exists now(at plugin invocation time) not as it existed in the past when the
WAL was generated. In addition to the node ID I can see us wanting to be
able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

b) What the slony node id is of the node we are streaming too. It would be
nice if we could pass extra, arbitrary data/parameters to the output plugins
that could include that, or other things. At the moment the
start_logical_replication rule in repl_gram.y doesn't allow for that but I
don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!

Even though, from a data-correctness point of view, slony could commit the
transaction on the replica after it sees the t1 commit, we won't want it to
do commits other than on a SYNC boundary. This means that the replicas will
continue to move between consistent SYNC snapshots and that we can still
track the state/progress of replication by knowing what events (SYNC or
otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

This also means that slony should only provide feedback to the walsender on
SYNC boundaries after the transaction has committed on the receiver. I don't
see this as being an issue.

Yes, thats no problem. You need to give feedback more frequently
(otherwise walsender kicks you off), but you don't have to increase the
confirmed flush location.

Setting up Subscriptions
===================
At first we have a slon cluster with just 1 node, life is good. When a
second node is created and a path(or pair of paths) are defined between the
nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.
a. Get the current xlog recptr,
b. Query any non-sync events of interest from sl_event.
2. Connect to the remote node with a logical replication connection and
start streaming logical replication changes start at the recptr we retrieved
above.

Note that INIT_LOGICAL_REPLICATION can take some time to get to the
initial consistent state (especially if there are longrunning
transactions). So you should do the init in 1), query all the events in
the snapshot that returns and then go over to 2).

The remote_worker:copy_set will then need to get a consistent COPY of the
tables in the replication set such that any changes made to the tables after
the copy is started get included in the replication stream. The approach
proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for
this. I *think* slony could get by with something less fancy as well but
it would be ugly.

The snapshot exporting isn't really that much additional work as we
already need to support most of it for keeping state across restarts.

FAILOVER:
---------------
A---->B
| .
v .
C

Today with slony, if B is a valid failover target then it is a forwarding
node of the set. This means that B keeps a record in sl_log of any changes
originating on A until B knows that node C has received those changes. In
the event of a failover, if node C is far behind, it can just get the
missing data from sl_log on node B (the failover target/new origin).

I see a problem with what I have discussed above, B won't explicitly store
the data from A in sl_log, a cascaded node would depend on B's WAL stream.
The problem is that at FAILOVER time, B might have processed some changes
from A. Node C might also be processing Node B's WAL stream for events (or
data from another set). Node C will discard/not receive the data for A's
tables since it isn't subscribed to those tables from B. What happens then
if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because node
A has failed, and it can't get it from node B because node C has already
processed the WAL changes from node B that included the data but it
ignored/discarded it. Maybe node C could reprocess older WAL from node B?
Maybe this forces us to keep an sl_log type structure around?

I fear youve left me behind here, sorry, can't give you any input.

Is it complete enough to build a prototype?
==========================
I think so, the incomplete areas I see are the ones that mentioned in the
patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even though
we'd need them before I could build a production system.

Conclusions
=============
I like this design much better than the original design from the spring that
would have required keeping a catalog proxy on the decoding machine. Based
on what I've seen it should be possible to make slony use logical
replication as a source for events instead of triggers populating sl_log.
My thinking is that we want a way for logreceiver programs to pass
arguments/parameters to the output plugins. Beyond that this looks like
something slony can use.

Cool!

Don't hesitate to mention anything that you think would make you life
easier, chances are that youre not the only one who could benefit from
it...

Thanks,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#53

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#45)

Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

On 2012-11-17 19:14:06 +0900, Michael Paquier wrote:

On Fri, Nov 16, 2012 at 7:58 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi,

On 2012-11-16 13:44:45 +0900, Michael Paquier wrote:

This patch looks OK.

I got 3 comments:
1) Why changing the OID of pg_class_tblspc_relfilenode_index from 3171 to
3455? It does not look necessary.

Its a mismerge and should have happened in "Add a new RELFILENODE
syscache to fetch a pg_class entry via (reltablespace, relfilenode)" but
it seems I squashed the wrong two commits.
I had originally used 3171 but that since got used up for lo_tell64...

2) You should perhaps change the header of RelationMapFilenodeToOid so as
not mentionning it as the opposite operation of RelationMapOidToFilenode
but as an operation that looks for the OID of a relation based on its
relfilenode. Both functions are opposite but independent.

I described it as the opposite because RelationMapOidToFilenode is the
relmappers stated goal and RelationMapFilenodeToOid is just some
side-business.

3) Both functions are doing similar operations. Could it be possible
to wrap them in the same central function?

I don't really see how without making both quite a bit more
complicated. The amount of if's needed seems to be too large to me.

OK thanks for your answers.
As this patch only adds a new function and is not that much complicated, I
think there is no problem in committing it. The only thing to remove is the
diff in indexing.h. Could someone take care of that?

I pushed a rebase to the git repository that fixed it...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#54

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andrea Suisani (#37)

Re: logical changeset generation v3

On Fri, Nov 16, 2012 at 5:16 PM, Andrea Suisani <sickpig@opinioni.net>wrote:

Il 16/11/2012 05:34, Michael Paquier ha scritto:

Do you have a git repository or something where all the 14 patches are

applied? I would like to test the feature globally.
Sorry I recall that you put a link somewhere but I cannot remember its
email...

http://archives.postgresql.org/pgsql-hackers/2012-11/msg00686.php

Thanks Andrea.
I am pretty sure I will be able to provide some feedback by Friday.
--
Michael Paquier
http://michael.otacoo.com

#55

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#14)

1 attachment(s)

Re: [PATCH 13/14] Introduce pg_receivellog, the pg_receivexlog equivalent for logical changes

Hi,

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see
patch attached.
Thanks,

On Thu, Nov 15, 2012 at 10:17 AM, Andres Freund <andres@2ndquadrant.com>wrote:

---
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_receivellog.c | 717
+++++++++++++++++++++++++++++++++
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +
4 files changed, 725 insertions(+), 3 deletions(-)
create mode 100644 src/bin/pg_basebackup/pg_receivellog.c

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Michael Paquier
http://michael.otacoo.com

Attachments:

20121119_pg_receivellog_install.patchapplication/octet-stream; name=20121119_pg_receivellog_install.patchDownload

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 3775c44..d0474b3 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -34,6 +34,7 @@ pg_receivellog: pg_receivellog.o $(OBJS) | submake-libpq submake-libpgport
 install: all installdirs
 	$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
 	$(INSTALL_PROGRAM) pg_receivexlog$(X) '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+	$(INSTALL_PROGRAM) pg_receivellog$(X) '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
 
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
@@ -41,6 +42,7 @@ installdirs:
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
 	rm -f '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+	rm -f '$(DESTDIR)$(bindir)/pg_receivellog$(X)'
 
 clean distclean maintainer-clean:
-	rm -f pg_basebackup$(X) pg_receivexlog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o
+	rm -f pg_basebackup$(X) pg_receivexlog$(X) pg_receivellog$(X) $(OBJS) pg_basebackup.o pg_receivexlog.o pg_receivellog.o

#56

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Noname (#1)

Re: logical changeset generation v3

Hi Andres,

I have been able to fetch your code (thanks Andrea!) and some it. For the
time being I am spending some time reading the code and understanding the
whole set of features you are trying to implement inside core, even if I
got some background from what you presented at PGCon and from the hackers
ML.
Btw, as a first approach, I tried to run the logical log receiver plugged
on a postgres server, and I am not able to make it work.

Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v

After launching some SQLs, the logical receiver is stuck just after sending
INIT_LOGICAL_REPLICATION, please see bt of process waiting:
(gdb) bt
#0 0x00007f1bbc13b170 in __poll_nocancel () from /usr/lib/libc.so.6
#1 0x00007f1bbc43072d in pqSocketPoll (sock=3, forRead=1, forWrite=0,
end_time=-1) at fe-misc.c:1089
#2 0x00007f1bbc43060d in pqSocketCheck (conn=0x1dd0290, forRead=1,
forWrite=0, end_time=-1) at fe-misc.c:1031
#3 0x00007f1bbc4304dd in pqWaitTimed (forRead=1, forWrite=0,
conn=0x1dd0290, finish_time=-1) at fe-misc.c:963
#4 0x00007f1bbc4304af in pqWait (forRead=1, forWrite=0, conn=0x1dd0290) at
fe-misc.c:946
#5 0x00007f1bbc42c64c in PQgetResult (conn=0x1dd0290) at fe-exec.c:1709
#6 0x00007f1bbc42cd62 in PQexecFinish (conn=0x1dd0290) at fe-exec.c:1974
#7 0x00007f1bbc42c9c8 in PQexec (conn=0x1dd0290, query=0x406c60
"INIT_LOGICAL_REPLICATION 'test_decoding'") at fe-exec.c:1808
#8 0x0000000000402370 in StreamLog () at pg_receivellog.c:263
#9 0x00000000004030c9 in main (argc=6, argv=0x7fff44edb288) at
pg_receivellog.c:694
So I am not able to output any results using pg_receivellog.

Also, I noticed 2 errors in your set of tests.

On Thu, Nov 15, 2012 at 9:27 AM, Andres Freund <andres@anarazel.de> wrote:

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT
currval('replication_example_id_seq'));

In SET clause, the column name is *somedata* and not *somedate*

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT
currval('replication_example_id_seq'));

Same error here, *somedata* and not *somedate*. Not a big deal, it made the
transactions failing though.
--
Michael Paquier
http://michael.otacoo.com

#57

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#56)

Re: logical changeset generation v3

Hi Michael,

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:

I have been able to fetch your code (thanks Andrea!) and some it. For the
time being I am spending some time reading the code and understanding the
whole set of features you are trying to implement inside core, even if I
got some background from what you presented at PGCon and from the hackers
ML.

Cool.

Btw, as a first approach, I tried to run the logical log receiver plugged
on a postgres server, and I am not able to make it work.

Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v

After launching some SQLs, the logical receiver is stuck just after sending
INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT
currval('replication_example_id_seq'));

In SET clause, the column name is *somedata* and not *somedate*

Crap. Sorry for that. I wrote the example in the mailclient and then
executed it and I seem to have forgot to put back some of the fixes...

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see
patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary thats going to be replaced anyway lying
arround ;)

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#58

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#57)

Re: logical changeset generation v3

On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hi Michael,

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:

I have been able to fetch your code (thanks Andrea!) and some it. For the
time being I am spending some time reading the code and understanding the
whole set of features you are trying to implement inside core, even if I
got some background from what you presented at PGCon and from the hackers
ML.

Cool.

Btw, as a first approach, I tried to run the logical log receiver plugged
on a postgres server, and I am not able to make it work.

Well, I am using settings similar to yours.
# Run master
rm -r ~/bin/pgsql/master/
initdb -D ~/bin/pgsql/master/
echo "local replication $USER trust" >> ~/bin/pgsql/master/pg_hba.conf
postgres -D ~/bin/pgsql/master \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "
# Logical log receiver
pg_receivellog -f $HOME/output.txt -d postgres -v

After launching some SQLs, the logical receiver is stuck just after

sending

INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please

see

patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway lying
around ;)

OK no problem. For sure this is going to happen, I was wondering myself if
it could be possible to merge pg_receivexlog and pg_receivellog into a
single utility with multiple modes :)

Btw, here are some extra comments based on my progress, hope it will be
useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side or
the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was running
to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.
3) Assertion failure while running pgbench, I was just curious to see how
it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres
4) Mentionned by Andres above, but logical replication begins only there is
a xl_running_xacts record. I just enforced a checkpoint manually.

With all those things done, I have been able to set up the system, for
example those queries:
postgres=# create table ac (a int);
CREATE TABLE
postgres=# insert into ac values (1);
INSERT 0 1
created the expected output:
BEGIN 32135
COMMIT 32135
BEGIN 32136
table "ac": INSERT: a[int4]:1
COMMIT 32136

Now it is time to dig into the code...
--
Michael Paquier
http://michael.otacoo.com

#59

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Andres Freund (#52)

Re: logical changeset generation v3 - Source for Slony

On 12-11-18 11:07 AM, Andres Freund wrote:

Hi Steve!

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

libpglogicalrep.so ?

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{
write(outfd, message->data, message->length);
if (received_100_messages)
{
fsync(outfd);
StreamingLLogConfirm(message);
}
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.

How about we pass a timeout value to StreamingLLogNextMessage (..) where
it returns if no data is available after the timeout to give the caller
a chance to do something else.

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs. Ie [
{id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to
{column1,value1,column2,value2...}

I don't t think our output plugin will be much more complicated than the
test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

I'm going to think about this some more and maybe try to write an
example plugin before I can say anything with confidence.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid
- never visible outside of wal/system catalogs
- externally visible it gets
- assigned an uuid
- optionally assigned a user defined name
- user settable (permissions?) origin when executing sql:
- SET change_origin_uuid = 'uuid';
- SET change_origin_name = 'user-settable-name';
- defaults to the local node
- decoding callbacks get passed the origin of a change
- txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins,
so the others don't even get the ReorderBuffer treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

So would the uuid and optional name assignment be done in the output
plugin or some else?
When/how does the uuid get generated and where do we store it so the
same uuid gets returned when postgres restarts. Slony today stores all
this type of stuff in user-level tables and user-level functions
(because it has no other choice). What is the connection between
these values and the 'slot-id' in your proposal for the init arguments?
Does the slot-id need to be the external uuid of the other end or is
there no direct connection?

Today slony allows us to replicate between two databases in the same
postgresql cluster (I use this for testing all the time)
Slony also allows for two different 'slony clusters' to be setup in the
same database (or so I'm told, I don't think I have ever tried this myself).

plugin functions that let me query the local database and then return
the uuid and origin_name would work in this model.

+1 on being able to mark the 'change origin' in a SET command when the
replication process is pushing data into the replica.

Exactly how we do this filtering is an open question, I think the output
plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy to
figure out if the output plugin is able/allowed to query its database. Will
this be possible? I would expect to be able to query the database as it
exists now(at plugin invocation time) not as it existed in the past when the
WAL was generated. In addition to the node ID I can see us wanting to be
able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

For example my output plugin would want the list of replicated tables
(or the list of tables replicated to a particular replica). This list
can change over time. As administrators issue commands to add or remove
tables to replication or otherwise reshape the cluster the output plugin
will need to know about this. I MIGHT be able to get away with having
slon disconnect and reconnect on reconfiguration events so only the
init() call would need this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing
certain configuration things (like dropping a table from a set) while a
subscription is in progress. Being able to timetravel the slony
configuration tables might make this type of foot-gun a lot harder to
encounter but that might be asking for too much.

b) What the slony node id is of the node we are streaming too. It would be
nice if we could pass extra, arbitrary data/parameters to the output plugins
that could include that, or other things. At the moment the
start_logical_replication rule in repl_gram.y doesn't allow for that but I
don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to
pass in a list of tables (ie an array?) could I?
Above I talked about having the init() or change() methods query the
local database. Another option might be to make the slon build up this
data (by querying the database over a normal psql connection) and just
passing the data in. However that might mean passing in a list of a
few thousand table names, which doesn't sound like a good idea.

Even though, from a data-correctness point of view, slony could commit the
transaction on the replica after it sees the t1 commit, we won't want it to
do commits other than on a SYNC boundary. This means that the replicas will
continue to move between consistent SYNC snapshots and that we can still
track the state/progress of replication by knowing what events (SYNC or
otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on
cluster reshaping that you didn't understand. Slony depends on knowing
what data has , or hasn't been sent to a replica at a particular event
id. If 'some' transactions in between two SYNC events have committed
but not others then slony has no idea what data it needs to get
elsewhere on a FAILOVER type event. There might be a way to make this
work otherwise but I'm not sure what that is and how long it will take
to debug out the issues.

Show quoted text

Cool! Don't hesitate to mention anything that you think would make you
life easier, chances are that youre not the only one who could benefit
from it... Thanks, Andres

#60

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#58)

Re: logical changeset generation v3

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:

On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog, please see
patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway lying
around ;)

OK no problem. For sure this is going to happen, I was wondering myself if
it could be possible to merge pg_receivexlog and pg_receivellog into a
single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either..

Btw, here are some extra comments based on my progress, hope it will be
useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side or
the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was running
to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

3) Assertion failure while running pgbench, I was just curious to see how
it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#61

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Steve Singer (#59)

Re: logical changeset generation v3 - Source for Slony

Hi,

On 2012-11-19 19:50:32 -0500, Steve Singer wrote:

On 12-11-18 11:07 AM, Andres Freund wrote:

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

libpglogicalrep.so ?

Yea. We don't really have the infrastructure for that yet
though... Robert and me were just talking about that recently...

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{
write(outfd, message->data, message->length);
if (received_100_messages)
{
fsync(outfd);
StreamingLLogConfirm(message);
}
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.

How about we pass a timeout value to StreamingLLogNextMessage (..) where it
returns if no data is available after the timeout to give the caller a
chance to do something else.

Doesn't really integrate into the sort of loop thats often built around
poll(2), select(2) and similar. It probably should return NULL if
there's nothing there yet and we should have a
StreamingLLogWaitForMessage() or such.

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs. Ie [
{id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to
{column1,value1,column2,value2...}

Still means you need to escape and later pasrse columnN, valueN
values. I would have expected something like (length:data, length:data)+

I don't t think our output plugin will be much more complicated than the
test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

I'm going to think about this some more and maybe try to write an example
plugin before I can say anything with confidence.

That would be very good.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid
- never visible outside of wal/system catalogs
- externally visible it gets
- assigned an uuid
- optionally assigned a user defined name
- user settable (permissions?) origin when executing sql:
- SET change_origin_uuid = 'uuid';
- SET change_origin_name = 'user-settable-name';
- defaults to the local node
- decoding callbacks get passed the origin of a change
- txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins,
so the others don't even get the ReorderBuffer treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

So would the uuid and optional name assignment be done in the output plugin
or some else?

That would be postgres infrastructure. The output plugin would get
passed at least uuid and name and potentially the internal name as well
(might be useful to build some internal caching of information).

When/how does the uuid get generated and where do we store it so the same
uuid gets returned when postgres restarts. Slony today stores all this type
of stuff in user-level tables and user-level functions (because it has no
other choice).

Would need to be its own system catalog.

What is the connection between these values and the
'slot-id' in your proposal for the init arguments? Does the slot-id need to
be the external uuid of the other end or is there no direct connection?

None really. The "slot-id" really is only an identifier for a
replication connection (which should live longer than a single
postmaster run) which contains information about the point up to which
you replicated. We need to manage some local resources based on that.

Today slony allows us to replicate between two databases in the same
postgresql cluster (I use this for testing all the time)
Slony also allows for two different 'slony clusters' to be setup in the same
database (or so I'm told, I don't think I have ever tried this myself).

Yuck. I haven't thought about this very much. I honestly don't see
support for the first case right now. The second shouldn't be too hard,
we already have the database oid available everywhere we need it.

plugin functions that let me query the local database and then return the
uuid and origin_name would work in this model.

Should be possible.

+1 on being able to mark the 'change origin' in a SET command when the
replication process is pushing data into the replica.

Good.

Exactly how we do this filtering is an open question, I think the output
plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy to
figure out if the output plugin is able/allowed to query its database. Will
this be possible? I would expect to be able to query the database as it
exists now(at plugin invocation time) not as it existed in the past when the
WAL was generated. In addition to the node ID I can see us wanting to be
able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

For example my output plugin would want the list of replicated tables (or
the list of tables replicated to a particular replica). This list can change
over time. As administrators issue commands to add or remove tables to
replication or otherwise reshape the cluster the output plugin will need to
know about this. I MIGHT be able to get away with having slon disconnect
and reconnect on reconfiguration events so only the init() call would need
this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing
certain configuration things (like dropping a table from a set) while a
subscription is in progress. Being able to timetravel the slony
configuration tables might make this type of foot-gun a lot harder to
encounter but that might be asking for too much.

Actually timetravel access to those tables is considerably
easier/faster. I wanted to provide such tables anyway (because you need
them to safely write your own pg_enum alike types). It means that you
log slightly (32 + sizeof(XLogRecord) afair) more per modified row.

b) What the slony node id is of the node we are streaming too. It would be
nice if we could pass extra, arbitrary data/parameters to the output plugins
that could include that, or other things. At the moment the
start_logical_replication rule in repl_gram.y doesn't allow for that but I
don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to pass in
a list of tables (ie an array?) could I?

I was thinking of just a textual (key = value, ...) style list, similar
to options to EXPLAIN, COPY et al.

Above I talked about having the init() or change() methods query the local
database. Another option might be to make the slon build up this data (by
querying the database over a normal psql connection) and just passing the
data in. However that might mean passing in a list of a few thousand table
names, which doesn't sound like a good idea.

No, it certainly doesn't.

Even though, from a data-correctness point of view, slony could commit the
transaction on the replica after it sees the t1 commit, we won't want it to
do commits other than on a SYNC boundary. This means that the replicas will
continue to move between consistent SYNC snapshots and that we can still
track the state/progress of replication by knowing what events (SYNC or
otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on
cluster reshaping that you didn't understand. Slony depends on knowing what
data has , or hasn't been sent to a replica at a particular event id. If
'some' transactions in between two SYNC events have committed but not others
then slony has no idea what data it needs to get elsewhere on a FAILOVER
type event. There might be a way to make this work otherwise but I'm not
sure what that is and how long it will take to debug out the issues.

Ah, it starts to make sense.

The way I solved that issue in the prototype from arround pgcon was that
I included the LSN from the original commit record in the remote
transaction into the commit record of the local transaction (with the
origin_id set to the remote side). That allowed to trivially restore the
exact state of replication after a crash even with
synchronous_commit=off as during replay you could simply ensure the
replication-surely-received-lsn of every remote side was up to date.
Then you can simply do a START_LOGICAL_REPLICATION 'slot'
just-recovered/lsn; and restart applying (*not*
INIT_LOGICAL_REPLICATION).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#62

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#60)

Re: logical changeset generation v3

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

3) Assertion failure while running pgbench, I was just curious to see

how

it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File:

"snapbuild.c",

Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
outputted anything in the logs, but here is the backtrace of the core file
produced.
#2 0x0000000000865145 in ExceptionalCondition (conditionName=0xa15100
"!(((xid) >= ((TransactionId) 3)) && ((snapstate->xmin_running) >=
((TransactionId) 3)))", errorType=0xa14f3b "FailedAssertion",
fileName=0xa14ed0 "snapbuild.c", lineNumber=877) at assert.c:54
#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,
xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,
snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at
snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,
snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732
#6 0x00000000007080b9 in DecodeRecordIntoReorderBuffer (reader=0x1a08300,
state=0x19e4e20, buf=0x1a0a368) at decode.c:84
#7 0x0000000000708cad in replay_finished_record (state=0x1a08300,
buf=0x1a0a368) at logicalfuncs.c:54
#8 0x00000000004d8033 in XLogReaderRead (state=0x1a08300) at
xlogreader.c:965
#9 0x000000000070f7c3 in XLogSendLogical (caughtup=0x7fffb22c35fb "") at
walsender.c:1721
#10 0x000000000070ea05 in WalSndLoop (send_data=0x70f6e2 <XLogSendLogical>)
at walsender.c:1184
#11 0x000000000070e0eb in StartLogicalReplication (cmd=0x190eaa8) at
walsender.c:726
#12 0x000000000070e3ac in exec_replication_command (cmd_string=0x19a65c8
"START_LOGICAL_REPLICATION 'id-0' 0/7E1855C") at walsender.c:853
#13 0x0000000000753ee0 in PostgresMain (argc=2, argv=0x18f63d8,
username=0x18f62a8 "michael") at postgres.c:3974
#14 0x00000000006f13ea in BackendRun (port=0x1912600) at postmaster.c:3668
#15 0x00000000006f0b76 in BackendStartup (port=0x1912600) at
postmaster.c:3352
#16 0x00000000006ed900 in ServerLoop () at postmaster.c:1431
#17 0x00000000006ed208 in PostmasterMain (argc=13, argv=0x18f40a0) at
postmaster.c:1180
#18 0x0000000000657517 in main (argc=13, argv=0x18f40a0) at main.c:197
I'm keeping this core and the binary btw.
--
Michael Paquier
http://michael.otacoo.com

#63

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#60)

Re: logical changeset generation v3

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:

On Mon, Nov 19, 2012 at 5:50 PM, Andres Freund <andres@2ndquadrant.com
wrote:

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:

I am just looking at this patch and will provide some comments.
By the way, you forgot the installation part of pg_receivellog,

please see

patch attached.

That actually was somewhat intended, I thought people wouldn't like the
name and I didn't want a binary that's going to be replaced anyway

lying

around ;)

OK no problem. For sure this is going to happen, I was wondering myself

if

it could be possible to merge pg_receivexlog and pg_receivellog into a
single utility with multiple modes :)

Don't really see that, the differences already are significant and imo
are bound to get bigger. Shouldn't live in pg_basebackup/ either...

I am sure that this will be the object of many future discussions.

Btw, here are some extra comments based on my progress, hope it will be
useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side

or

the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was

running

to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A NOTICE would be more adapted, a WARNING means that something that may
endanger the system has happened, but as far as I understand from your
explanation this is not the case.

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

OK, so this snapshot only needs to include the XIDs of transactions that
have modified the catalogs. Do I get it right? This way you are able to
fetch the correct relation definition for replication decoding.

Just thinking but... It looks to be a waste to store the transactions XIDs
of all the committed transactions, but on the other hand there is no way to
track the XIDs of transactions that modified a catalog in current core
code. So yes this approach is better as refining the transaction XID
tracking for snapshot reconstruction is something that could be improved
later. Those are only thoughts though...

The reason for the ABORTs is related but different. We start out in the

"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.

Just to be clear, lsn means the log-sequence number associated to each xlog
record?

Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

OK got it thanks for your explanation.

So, once again coming to it, we need in the snapshot built only the XIDs of
transactions that modified the catalogs to get a consistent view of
relation info for decoding.
Really, I think that refining the XID tracking to minimize the size of the
snapshot built for decoding would be really a key for performance
improvement especially for OLTP-type applications (lots of transactions
involved, few of them involving catalogs).
--
Michael Paquier
http://michael.otacoo.com

#64

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#63)

Re: logical changeset generation v3

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:

Btw, here are some extra comments based on my progress, hope it will be
useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server side

or

the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was

running

to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A NOTICE would be more adapted, a WARNING means that something that may
endanger the system has happened, but as far as I understand from your
explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

OK, so this snapshot only needs to include the XIDs of transactions that
have modified the catalogs. Do I get it right? This way you are able to
fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

Just thinking but... It looks to be a waste to store the transactions XIDs
of all the committed transactions, but on the other hand there is no way to
track the XIDs of transactions that modified a catalog in current core
code. So yes this approach is better as refining the transaction XID
tracking for snapshot reconstruction is something that could be improved
later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

The reason for the ABORTs is related but different. We start out in the

"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.

Just to be clear, lsn means the log-sequence number associated to each xlog
record?

Yes. And that number is just the position in the stream.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#65

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#62)

Re: logical changeset generation v3

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

3) Assertion failure while running pgbench, I was just curious to see

how

it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File:

"snapbuild.c",

Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
outputted anything in the logs, but here is the backtrace of the core file
produced.

Could you run it with log_level=DEBUG2?

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#66

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#62)

Re: logical changeset generation v3

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Those aren't unexpected. Perhaps I should not make it a warning then...

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

The reason for the ABORTs is related but different. We start out in the
"SNAPBUILD_START" state when we try to build a snapshot. When we find
initial information about running transactions (i.e. xl_running_xacts)
we switch to the "SNAPBUILD_FULL_SNAPSHOT" state which means we can
decode all changes in transactions that start *after* the current
lsn. Earlier transactions might have tuples on a catalog state we can't
query.
Only when all transactions we observed as running before the
FULL_SNAPSHOT state have finished we switch to SNAPBUILD_CONSISTENT.
As we want a consistent/reproducible set of transactions to produce
output via the logstream we only pass transactions to the output plugin
if they commit *after* CONSISTENT (they can start earlier though!). This
allows us to produce a pg_dump compatible snapshot in the moment we get
consistent that contains exactly the changes we won't stream out.

Makes sense?

3) Assertion failure while running pgbench, I was just curious to see

how

it reacted when logical replication was put under a little bit of load.
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File:

"snapbuild.c",

Line: 877)
=> pgbench -i postgres; pgbench -T 500 -c 8 postgres

Can you reproduce this one? I would be interested in log output. Because
I did run pgbench for quite some time and I haven't seen that one after
fixing some issues last week.

It implies that snapstate->nrrunning has lost touch with reality...

Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
outputted anything in the logs, but here is the backtrace of the core file
produced.

Ah, I see. Could you try the following diff?

diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,
                 */
                snapstate->transactions_after = buf->origptr;

+ snapstate->nrrunning = running->xcnt;
snapstate->xmin_running = InvalidTransactionId;
snapstate->xmax_running = InvalidTransactionId;

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#67

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#65)

Re: logical changeset generation v3

On Wed, Nov 21, 2012 at 4:31 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com
wrote:

It implies that snapstate->nrrunning has lost touch with reality...

Yes, I can reproduce in 10-20 seconds in one of my linux boxes. I haven't
outputted anything in the logs, but here is the backtrace of the core

file

produced.

Could you run it with log_level=DEBUG2?

Let me try.

Do you run pgbench after youve reached a consistent state (by issuing a
manual checkpoint)?

Yes. I issue a manual checkpoint to initialize the replication.
--
Michael Paquier
http://michael.otacoo.com

#68

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#64)

Re: logical changeset generation v3

On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com
wrote:

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:

Btw, here are some extra comments based on my progress, hope it will

be

useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server

side

or

the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was

running

to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A NOTICE would be more adapted, a WARNING means that something that may
endanger the system has happened, but as far as I understand from your
explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

OK, so this snapshot only needs to include the XIDs of transactions that
have modified the catalogs. Do I get it right? This way you are able to
fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

Hum. I might have missed something but what is the variable tracking the
newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you
describe.

Just thinking but... It looks to be a waste to store the transactions

XIDs

of all the committed transactions, but on the other hand there is no way

to

track the XIDs of transactions that modified a catalog in current core
code. So yes this approach is better as refining the transaction XID
tracking for snapshot reconstruction is something that could be improved
later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

How to you track them? I think I need to go deeper in the code before
asking more...
--
Michael Paquier
http://michael.otacoo.com

#69

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#68)

Re: logical changeset generation v3

On 2012-11-21 16:47:11 +0900, Michael Paquier wrote:

On Wed, Nov 21, 2012 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-21 15:28:30 +0900, Michael Paquier wrote:

On Tue, Nov 20, 2012 at 8:22 PM, Andres Freund <andres@2ndquadrant.com
wrote:

On 2012-11-20 09:30:40 +0900, Michael Paquier wrote:

Btw, here are some extra comments based on my progress, hope it will

be

useful for other people playing around with your patches.
1) Necessary to install the contrib module test_decoding on server

side

or

the test case will not work.
2) Obtention of the following logs on server:
LOG: forced to assume catalog changes for xid 1370 because it was

running

to early
WARNING: ABORT 1370
Actually I saw that there are many warnings like this.

Those aren't unexpected. Perhaps I should not make it a warning then...

A NOTICE would be more adapted, a WARNING means that something that may
endanger the system has happened, but as far as I understand from your
explanation this is not the case.

I think it should go DEBUG2 or so once were a bit more confident about
the code.

A short explanation:

We can only decode tuples we see in the WAL when we already have a
timetravel catalog snapshot before that transaction started. To build
such a snapshot we need to collect information about committed which
changed the catalog. Unfortunately we can't diagnose whether a txn
changed the catalog without a snapshot so we just assume all committed
ones do - it just costs a bit of memory. Thats the background of the
"forced to assume catalog changes for ..." message.

OK, so this snapshot only needs to include the XIDs of transactions that
have modified the catalogs. Do I get it right? This way you are able to
fetch the correct relation definition for replication decoding.

Yes. We only carry those between (recenXmin, newestCatalogModifyingTxn),
so its not all of them. Normal snapshots carry all in-progress
transactionids instead of the committed ones, but that would have been
far more in our case (only a minority of txn's touch the catalog) and it
has problems with subtransaction tracking.

Hum. I might have missed something but what is the variable tracking the
newest XID that modified catalogs.
I can see of course recentXmin in snapmgr.c but nothing related to what you
describe.

We determine that ourselves.

SnapBuildCommitTxn(Snapstate *snapstate, ReorderBuffer *reorder,
XLogRecPtr lsn, TransactionId xid,
int nsubxacts, TransactionId *subxacts)
{
...
if (forced_timetravel || top_does_timetravel || sub_does_timetravel)
{
if (!TransactionIdIsValid(snapstate->xmax) ||
NormalTransactionIdFollows(xid, snapstate->xmax))
{
snapstate->xmax = xid;
TransactionIdAdvance(snapstate->xmax);
}

Just thinking but... It looks to be a waste to store the transactions

XIDs

of all the committed transactions, but on the other hand there is no way

to

track the XIDs of transactions that modified a catalog in current core
code. So yes this approach is better as refining the transaction XID
tracking for snapshot reconstruction is something that could be improved
later. Those are only thoughts though...

We actually only track xids of catalog modifying transactions once we
hit the CONSISTENT state. Before the initial snapshot we can't detect
that.

How to you track them? I think I need to go deeper in the code before
asking more...

You mean, how do I detect they are catalog modifying? By asking the
reorderbuffer (ReorderBufferXidDoesTimetravel(...)). That one knows
because we told him so (ReorderBufferXidSetTimetravel()) and we do that
by looking at the type of xid records we've seen incoming (HEAP_INPLACE,
HEAP2_NEW_CID tell us its doing timetravel).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#70

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#66)

Re: logical changeset generation v3

On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:

Ah, I see. Could you try the following diff?
diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,
*/
snapstate->transactions_after = buf->origptr;
+ snapstate->nrrunning = running->xcnt;
snapstate->xmin_running = InvalidTransactionId;
snapstate->xmax_running = InvalidTransactionId;

I am still getting the same assertion failure even with this diff included.
--
Michael Paquier
http://michael.otacoo.com

#71

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Andres Freund (#33)

Re: [PATCH 03/14] Add simple xlogdump tool

On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:

On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 +++
src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 494 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

As-is this is just a development tool which was sorely needed for the
development of this patchset. But yes I think that once ready
(xlogreader infrastructure, *_desc routines splitted) it should
definitely be able to do most of what the above xlogdump can do and it
should live in bin/. I think mostly some filtering is missing.

That doesn't really "deprecate" the above though.

Does that answer your question?

Yes, I think so. Thanks.

(I've just recently gotten the original xlogdump to work for me in
9.2, and I had been wonder if back-porting yours to 9.2 would have
been an easier way to go.)

Cheers,

Jeff

#72

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Jeff Janes (#71)

Re: [PATCH 03/14] Add simple xlogdump tool

On 2012-11-21 14:57:14 -0800, Jeff Janes wrote:

On Thu, Nov 15, 2012 at 9:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-11-15 09:06:23 -0800, Jeff Janes wrote:

On Wed, Nov 14, 2012 at 5:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 +++
src/bin/xlogdump/xlogdump.c | 468 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 494 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Is this intended to be the successor of
https://github.com/snaga/xlogdump which will then be deprecated?

As-is this is just a development tool which was sorely needed for the
development of this patchset. But yes I think that once ready
(xlogreader infrastructure, *_desc routines splitted) it should
definitely be able to do most of what the above xlogdump can do and it
should live in bin/. I think mostly some filtering is missing.

That doesn't really "deprecate" the above though.

Does that answer your question?

Yes, I think so. Thanks.

(I've just recently gotten the original xlogdump to work for me in
9.2, and I had been wonder if back-porting yours to 9.2 would have
been an easier way to go.)

I don't think you would have much fun doing so - the WAL format changes
between 9.2 and 9.3 make this larger than one might think. I had a
version that worked with the previous format but there have been some
interface changes since then...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#73

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#70)

Re: logical changeset generation v3

On 2012-11-21 18:35:34 +0900, Michael Paquier wrote:

On Wed, Nov 21, 2012 at 4:34 PM, Andres Freund <andres@2ndquadrant.com>wrote:
On 2012-11-21 14:57:08 +0900, Michael Paquier wrote:

Ah, I see. Could you try the following diff?
diff --git a/src/backend/replication/logical/snapbuild.c
b/src/backend/replication/logical/snapbuild.c
index df24b33..797a126 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -471,6 +471,7 @@ SnapBuildDecodeCallback(ReorderBuffer *reorder,
Snapstate *snapstate,
*/
snapstate->transactions_after = buf->origptr;
+ snapstate->nrrunning = running->xcnt;
snapstate->xmin_running = InvalidTransactionId;
snapstate->xmax_running = InvalidTransactionId;
I am still getting the same assertion failure even with this diff included.

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING: connecting to postgres
WARNING: Initiating logical rep
LOG: computed new xmin: 16566894
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3 0x000000000070c409 in SnapBuildTxnIsRunning (snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange (reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368, relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback (reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line 387ff is:
else if (snapstate->state < SNAPBUILD_CONSISTENT &&
SnapBuildTxnIsRunning(snapstate, xid))
;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#74

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Andres Freund (#73)

Re: logical changeset generation v3

On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com>wrote:

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING: connecting to postgres
WARNING: Initiating logical rep
LOG: computed new xmin: 16566894
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3 0x000000000070c409 in SnapBuildTxnIsRunning
(snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange
(reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback
(reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line
387ff is:
else if (snapstate->state < SNAPBUILD_CONSISTENT &&
SnapBuildTxnIsRunning(snapstate, xid))
;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Still this *impossible* thing happens.
Here are some more information on the logs I get on server side:

Yes I have the logical replication correctly initialized:
[629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG: database system is ready to accept connections
[633 0] LOG: autovacuum launcher started
[648 0] WARNING: connecting to postgres
[648 0] WARNING: Initiating logical rep
[648 0] LOG: computed new xmin: 684
[648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000

And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8
abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10
tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641
filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559
delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651
filler[bpchar]:(null)
COMMIT 698

Until the assertion failure:
TRAP: FailedAssertion("!(((xid) >= ((TransactionId) 3)) &&
((snapstate->xmin_running) >= ((TransactionId) 3)))", File: "snapbuild.c",
Line: 878)
I still have the core file and its binary at hand if you want, so can send
them at will.
I have not been able to read your code yet, but there should be something
you are missing.

Thanks,
--
Michael Paquier
http://michael.otacoo.com

#75

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Michael Paquier (#74)

Re: logical changeset generation v3

On 2012-11-22 09:13:30 +0900, Michael Paquier wrote:

On Thu, Nov 22, 2012 at 8:25 AM, Andres Freund <andres@2ndquadrant.com>wrote:

I really don't understand whats going on here then. Youve said you made
sure that there is a catalog snapshot. Which means you would need
something like:
WARNING: connecting to postgres
WARNING: Initiating logical rep
LOG: computed new xmin: 16566894
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 3/E62457C0, scrolled back to 3/E6244000
LOG: found initial snapshot (via running xacts). Done: 1

in the log *and* it means that snapbuild->state has to be
CONSISTENT. But the backtrace youve posted:

#3 0x000000000070c409 in SnapBuildTxnIsRunning
(snapstate=0x19e4f10,xid=0) at snapbuild.c:877
#4 0x000000000070b8e4 in SnapBuildProcessChange
(reorder=0x19e4e80,snapstate=0x19e4f10, xid=0, buf=0x1a0a368,
relfilenode=0x1a0a450) at snapbuild.c:388
#5 0x000000000070c088 in SnapBuildDecodeCallback
(reorder=0x19e4e80,snapstate=0x19e4f10, buf=0x1a0a368) at snapbuild.c:732

shows pretty clearly that snapstate *can't* be consistent because line
387ff is:
else if (snapstate->state < SNAPBUILD_CONSISTENT &&
SnapBuildTxnIsRunning(snapstate, xid))
;
so #3 #4 can't happen at those line numbers with state == CONSISTENT.

Still this *impossible* thing happens.
Here are some more information on the logs I get on server side:

Yes I have the logical replication correctly initialized:
[629 0] LOG: database system was shut down at 2012-11-22 09:02:42 JST
[628 0] LOG: database system is ready to accept connections
[633 0] LOG: autovacuum launcher started
[648 0] WARNING: connecting to postgres
[648 0] WARNING: Initiating logical rep
[648 0] LOG: computed new xmin: 684
[648 0] LOG: start reading from 0/178C1B8, scrolled back to 0/178C000

Ok, so youve not yet reached a consistent point.

Which means this shouldn't yet be written out:

And I am also getting logs of this type with pg_receivellog:
BEGIN 698
table "pgbench_accounts": UPDATE: aid[int4]:759559 bid[int4]:8
abalance[int4]:-3641 filler[bpchar]:
table "pgbench_tellers": UPDATE: tid[int4]:93 bid[int4]:10
tbalance[int4]:-3641 filler[bpchar]:(null)
table "pgbench_branches": UPDATE: bid[int4]:10 bbalance[int4]:-3641
filler[bpchar]:(null)
table "pgbench_history": INSERT: tid[int4]:93 bid[int4]:10 aid[int4]:759559
delta[int4]:-3641 mtime[timestamp]:2012-11-22 09:05:34.535651
filler[bpchar]:(null)
COMMIT 698

that could already be good enough of a hint, let me check tomorrow.

I still have the core file and its binary at hand if you want, so can send
them at will.

If those aren't too big, its worth a try...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#76

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Alvaro Herrera (#34)

1 attachment(s)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

This is very much the same as the previous patch, except it has been
rebased to the latest master.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

xlogreader-heikki-3.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/transam/Makefile
--- b/src/backend/access/transam/Makefile
***************
*** 14,20 **** include $(top_builddir)/src/Makefile.global
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 14,20 ----
  
  OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
  	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
! 	xlogreader.o xlogutils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 30,35 ****
--- 30,36 ----
  #include "access/twophase.h"
  #include "access/xact.h"
  #include "access/xlog_internal.h"
+ #include "access/xlogreader.h"
  #include "access/xlogutils.h"
  #include "catalog/catversion.h"
  #include "catalog/pg_control.h"
***************
*** 192,205 **** static bool LocalHotStandbyActive = false;
   */
  static int	LocalXLogInsertAllowed = -1;
  
! /* Are we recovering using offline XLOG archives? (only valid in the startup process) */
! bool InArchiveRecovery = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
  /* options taken from recovery.conf for archive recovery */
! char *recoveryRestoreCommand = NULL;
  static char *recoveryEndCommand = NULL;
  static char *archiveCleanupCommand = NULL;
  static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
--- 193,209 ----
   */
  static int	LocalXLogInsertAllowed = -1;
  
! /*
!  * Are we recovering using offline XLOG archives? (only valid in the startup
!  * process)
!  */
! bool		InArchiveRecovery = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
  /* options taken from recovery.conf for archive recovery */
! char	   *recoveryRestoreCommand = NULL;
  static char *recoveryEndCommand = NULL;
  static char *archiveCleanupCommand = NULL;
  static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
***************
*** 210,216 **** static TimestampTz recoveryTargetTime;
  static char *recoveryTargetName;
  
  /* options taken from recovery.conf for XLOG streaming */
! bool StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
  
--- 214,220 ----
  static char *recoveryTargetName;
  
  /* options taken from recovery.conf for XLOG streaming */
! bool		StandbyMode = false;
  static char *PrimaryConnInfo = NULL;
  static char *TriggerFile = NULL;
  
***************
*** 389,395 **** typedef struct XLogCtlData
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
! 	XLogSegNo	lastRemovedSegNo; /* latest removed/recycled XLOG segment */
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
--- 393,400 ----
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
! 	XLogSegNo	lastRemovedSegNo;		/* latest removed/recycled XLOG
! 										 * segment */
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
***************
*** 530,554 **** static XLogSegNo openLogSegNo = 0;
  static uint32 openLogOff = 0;
  
  /*
!  * These variables are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
   * page has been read into readBuf, and readSource indicates where we got
   * the currently open file from.
   */
! static int	readFile = -1;
! static XLogSegNo readSegNo = 0;
! static uint32 readOff = 0;
! static uint32 readLen = 0;
! static bool	readFileHeaderValidated = false;
! static int	readSource = 0;		/* XLOG_FROM_* code */
! 
! /*
!  * Keeps track of which sources we've tried to read the current WAL
!  * record from and failed.
!  */
! static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
  
  /*
   * These variables track when we last obtained some WAL data to process,
--- 535,563 ----
  static uint32 openLogOff = 0;
  
  /*
!  * Status data for XLogPageRead.
!  *
!  * The first three are used similarly to the ones above, but for reading
   * the XLOG.  Note, however, that readOff generally represents the offset
   * of the page just read, not the seek position of the FD itself, which
   * will be just past that page. readLen indicates how much of the current
   * page has been read into readBuf, and readSource indicates where we got
   * the currently open file from.
+  *
+  * failedSources keeps track of which sources we've tried to read the current
+  * WAL record from and failed.
   */
! typedef struct XLogPageReadPrivate
! {
! 	int			readFile;
! 	XLogSegNo	readSegNo;
! 	uint32		readOff;
! 	uint32		readLen;
! 	bool		readFileHeaderValidated;
! 	bool		fetching_ckpt;	/* are we fetching a checkpoint record? */
! 	int			readSource;		/* XLOG_FROM_* code */
! 	int			failedSources;	/* OR of XLOG_FROM_* codes */
! } XLogPageReadPrivate;
  
  /*
   * These variables track when we last obtained some WAL data to process,
***************
*** 559,571 **** static int	failedSources = 0;	/* OR of XLOG_FROM_* codes */
  static TimestampTz XLogReceiptTime = 0;
  static int	XLogReceiptSource = 0;		/* XLOG_FROM_* code */
  
- /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
- static char *readBuf = NULL;
- 
- /* Buffer for current ReadRecord result (expandable) */
- static char *readRecordBuf = NULL;
- static uint32 readRecordBufSize = 0;
- 
  /* State information for XLOG reading */
  static XLogRecPtr ReadRecPtr;	/* start of last record read */
  static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
--- 568,573 ----
***************
*** 609,615 **** typedef struct xl_restore_point
  
  
  static void readRecoveryCommandFile(void);
! static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void recoveryPausesHere(void);
  static void SetLatestXTime(TimestampTz xtime);
--- 611,618 ----
  
  
  static void readRecoveryCommandFile(void);
! static void exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
! 					XLogSegNo endLogSegNo);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void recoveryPausesHere(void);
  static void SetLatestXTime(TimestampTz xtime);
***************
*** 628,641 **** static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
! static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
! 			 int source, bool notexistOk);
! static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
! static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
! 			 bool randAccess);
! static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
! 							bool fetching_ckpt);
! static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
  static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
--- 631,644 ----
  static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  					   bool find_free, int *max_advance,
  					   bool use_lock);
! static int XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno,
! 			 int emode, TimeLineID tli, int source, bool notexistOk);
! static int XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno,
! 				   int emode, int sources);
! static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
! 			 int emode, bool randAccess, char *readBuf, void *private_data);
! static bool WaitForWALToBecomeAvailable(XLogPageReadPrivate *private,
! 							XLogRecPtr RecPtr, bool randAccess);
  static void XLogFileClose(void);
  static void PreallocXlogFiles(XLogRecPtr endptr);
  static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
***************
*** 643,654 **** static void UpdateLastRemovedPtr(char *filename);
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
  static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
! static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
! static void CheckRecoveryConsistency(void);
! static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly);
! static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
! 					  int emode, bool randAccess);
! static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
  static bool rescanLatestTimeLine(void);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
--- 646,658 ----
  static void ValidateXLOGDirectoryStructure(void);
  static void CleanupBackupHistory(void);
  static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
! static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
! 		   int emode, bool fetching_ckpt);
! static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
! static bool ValidXLogPageHeader(XLogSegNo segno, uint32 offset, int source,
! 					XLogPageHeader hdr, int emode, bool segmentonly);
! static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
! 					 XLogRecPtr RecPtr, int whichChkpt);
  static bool rescanLatestTimeLine(void);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
***************
*** 1515,1521 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
! 				 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
  				 (uint32) (XLogCtl->xlblocks[curridx] >> 32),
  				 (uint32) XLogCtl->xlblocks[curridx]);
  
--- 1519,1525 ----
  		 */
  		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
  			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
! 			(uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
  				 (uint32) (XLogCtl->xlblocks[curridx] >> 32),
  				 (uint32) XLogCtl->xlblocks[curridx]);
  
***************
*** 1581,1589 **** XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
  				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
  					ereport(PANIC,
  							(errcode_for_file_access(),
! 							 errmsg("could not seek in log file %s to offset %u: %m",
! 									XLogFileNameP(ThisTimeLineID, openLogSegNo),
! 									startoffset)));
  				openLogOff = startoffset;
  			}
  
--- 1585,1593 ----
  				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
  					ereport(PANIC,
  							(errcode_for_file_access(),
! 					 errmsg("could not seek in log file %s to offset %u: %m",
! 							XLogFileNameP(ThisTimeLineID, openLogSegNo),
! 							startoffset)));
  				openLogOff = startoffset;
  			}
  
***************
*** 1824,1830 **** UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  		if (!force && XLByteLT(newMinRecoveryPoint, lsn))
  			elog(WARNING,
  			   "xlog min recovery request %X/%X is past current point %X/%X",
! 				 (uint32) (lsn >> 32) , (uint32) lsn,
  				 (uint32) (newMinRecoveryPoint >> 32),
  				 (uint32) newMinRecoveryPoint);
  
--- 1828,1834 ----
  		if (!force && XLByteLT(newMinRecoveryPoint, lsn))
  			elog(WARNING,
  			   "xlog min recovery request %X/%X is past current point %X/%X",
! 				 (uint32) (lsn >> 32), (uint32) lsn,
  				 (uint32) (newMinRecoveryPoint >> 32),
  				 (uint32) newMinRecoveryPoint);
  
***************
*** 1878,1884 **** XLogFlush(XLogRecPtr record)
  		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
--- 1882,1888 ----
  		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
***************
*** 1942,1949 **** XLogFlush(XLogRecPtr record)
  		/*
  		 * Sleep before flush! By adding a delay here, we may give further
  		 * backends the opportunity to join the backlog of group commit
! 		 * followers; this can significantly improve transaction throughput, at
! 		 * the risk of increasing transaction latency.
  		 *
  		 * We do not sleep if enableFsync is not turned on, nor if there are
  		 * fewer than CommitSiblings other backends with active transactions.
--- 1946,1953 ----
  		/*
  		 * Sleep before flush! By adding a delay here, we may give further
  		 * backends the opportunity to join the backlog of group commit
! 		 * followers; this can significantly improve transaction throughput,
! 		 * at the risk of increasing transaction latency.
  		 *
  		 * We do not sleep if enableFsync is not turned on, nor if there are
  		 * fewer than CommitSiblings other backends with active transactions.
***************
*** 1958,1964 **** XLogFlush(XLogRecPtr record)
  			XLogCtlInsert *Insert = &XLogCtl->Insert;
  			uint32		freespace = INSERT_FREESPACE(Insert);
  
! 			if (freespace == 0)		/* buffer is full */
  				WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  			else
  			{
--- 1962,1968 ----
  			XLogCtlInsert *Insert = &XLogCtl->Insert;
  			uint32		freespace = INSERT_FREESPACE(Insert);
  
! 			if (freespace == 0) /* buffer is full */
  				WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
  			else
  			{
***************
*** 2011,2017 **** XLogFlush(XLogRecPtr record)
  		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  }
  
  /*
--- 2015,2021 ----
  		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 (uint32) (record >> 32), (uint32) record,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  }
  
  /*
***************
*** 2090,2096 **** XLogBackgroundFlush(void)
  		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
--- 2094,2100 ----
  		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
  			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
  			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
! 		   (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
  #endif
  
  	START_CRIT_SECTION();
***************
*** 2330,2336 **** XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
  	if (fd < 0)
  		ereport(ERROR,
  				(errcode_for_file_access(),
! 		   errmsg("could not open file \"%s\": %m", path)));
  
  	elog(DEBUG2, "done creating and filling new WAL file");
  
--- 2334,2340 ----
  	if (fd < 0)
  		ereport(ERROR,
  				(errcode_for_file_access(),
! 				 errmsg("could not open file \"%s\": %m", path)));
  
  	elog(DEBUG2, "done creating and filling new WAL file");
  
***************
*** 2569,2576 **** XLogFileOpen(XLogSegNo segno)
   * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
! XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
! 			 int source, bool notfoundOk)
  {
  	char		xlogfname[MAXFNAMELEN];
  	char		activitymsg[MAXFNAMELEN + 16];
--- 2573,2580 ----
   * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
! XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno, int emode,
! 			 TimeLineID tli, int source, bool notfoundOk)
  {
  	char		xlogfname[MAXFNAMELEN];
  	char		activitymsg[MAXFNAMELEN + 16];
***************
*** 2618,2626 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  		XLogFilePath(xlogfpath, tli, segno);
  		if (stat(xlogfpath, &statbuf) == 0)
  		{
! 			char oldpath[MAXPGPATH];
  #ifdef WIN32
  			static unsigned int deletedcounter = 1;
  			/*
  			 * On Windows, if another process (e.g a walsender process) holds
  			 * the file open in FILE_SHARE_DELETE mode, unlink will succeed,
--- 2622,2632 ----
  		XLogFilePath(xlogfpath, tli, segno);
  		if (stat(xlogfpath, &statbuf) == 0)
  		{
! 			char		oldpath[MAXPGPATH];
! 
  #ifdef WIN32
  			static unsigned int deletedcounter = 1;
+ 
  			/*
  			 * On Windows, if another process (e.g a walsender process) holds
  			 * the file open in FILE_SHARE_DELETE mode, unlink will succeed,
***************
*** 2687,2700 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  		set_ps_display(activitymsg, false);
  
  		/* Track source of data in assorted state variables */
! 		readSource = source;
  		XLogReceiptSource = source;
  		/* In FROM_STREAM case, caller tracks receipt time, not me */
  		if (source != XLOG_FROM_STREAM)
  			XLogReceiptTime = GetCurrentTimestamp();
  
  		/* The file header needs to be validated on first access */
! 		readFileHeaderValidated = false;
  
  		return fd;
  	}
--- 2693,2706 ----
  		set_ps_display(activitymsg, false);
  
  		/* Track source of data in assorted state variables */
! 		private->readSource = source;
  		XLogReceiptSource = source;
  		/* In FROM_STREAM case, caller tracks receipt time, not me */
  		if (source != XLOG_FROM_STREAM)
  			XLogReceiptTime = GetCurrentTimestamp();
  
  		/* The file header needs to be validated on first access */
! 		private->readFileHeaderValidated = false;
  
  		return fd;
  	}
***************
*** 2711,2717 **** XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
   * This version searches for the segment with any TLI listed in expectedTLIs.
   */
  static int
! XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  {
  	char		path[MAXPGPATH];
  	ListCell   *cell;
--- 2717,2724 ----
   * This version searches for the segment with any TLI listed in expectedTLIs.
   */
  static int
! XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno, int emode,
! 				   int sources)
  {
  	char		path[MAXPGPATH];
  	ListCell   *cell;
***************
*** 2736,2742 **** XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  
  		if (sources & XLOG_FROM_ARCHIVE)
  		{
! 			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true);
  			if (fd != -1)
  			{
  				elog(DEBUG1, "got WAL segment from archive");
--- 2743,2750 ----
  
  		if (sources & XLOG_FROM_ARCHIVE)
  		{
! 			fd = XLogFileRead(private, segno, emode, tli,
! 							  XLOG_FROM_ARCHIVE, true);
  			if (fd != -1)
  			{
  				elog(DEBUG1, "got WAL segment from archive");
***************
*** 2746,2752 **** XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
  
  		if (sources & XLOG_FROM_PG_XLOG)
  		{
! 			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_XLOG, true);
  			if (fd != -1)
  				return fd;
  		}
--- 2754,2761 ----
  
  		if (sources & XLOG_FROM_PG_XLOG)
  		{
! 			fd = XLogFileRead(private, segno, emode, tli,
! 							  XLOG_FROM_PG_XLOG, true);
  			if (fd != -1)
  				return fd;
  		}
***************
*** 3179,3280 **** RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
  }
  
  /*
-  * CRC-check an XLOG record.  We do not believe the contents of an XLOG
-  * record (other than to the minimal extent of computing the amount of
-  * data to read in) until we've checked the CRCs.
-  *
-  * We assume all of the record (that is, xl_tot_len bytes) has been read
-  * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
-  * record's header, which means in particular that xl_tot_len is at least
-  * SizeOfXlogRecord, so it is safe to fetch xl_len.
-  */
- static bool
- RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
- {
- 	pg_crc32	crc;
- 	int			i;
- 	uint32		len = record->xl_len;
- 	BkpBlock	bkpb;
- 	char	   *blk;
- 	size_t		remaining = record->xl_tot_len;
- 
- 	/* First the rmgr data */
- 	if (remaining < SizeOfXLogRecord + len)
- 	{
- 		/* ValidXLogRecordHeader() should've caught this already... */
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 				(errmsg("invalid record length at %X/%X",
- 						(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 	remaining -= SizeOfXLogRecord + len;
- 	INIT_CRC32(crc);
- 	COMP_CRC32(crc, XLogRecGetData(record), len);
- 
- 	/* Add in the backup blocks, if any */
- 	blk = (char *) XLogRecGetData(record) + len;
- 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- 	{
- 		uint32		blen;
- 
- 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
- 			continue;
- 
- 		if (remaining < sizeof(BkpBlock))
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("invalid backup block size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		memcpy(&bkpb, blk, sizeof(BkpBlock));
- 
- 		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("incorrect hole size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
- 
- 		if (remaining < blen)
- 		{
- 			ereport(emode_for_corrupt_record(emode, recptr),
- 					(errmsg("invalid backup block size in record at %X/%X",
- 							(uint32) (recptr >> 32), (uint32) recptr)));
- 			return false;
- 		}
- 		remaining -= blen;
- 		COMP_CRC32(crc, blk, blen);
- 		blk += blen;
- 	}
- 
- 	/* Check that xl_tot_len agrees with our calculation */
- 	if (remaining != 0)
- 	{
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 				(errmsg("incorrect total length in record at %X/%X",
- 						(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 
- 	/* Finally include the record header */
- 	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
- 	FIN_CRC32(crc);
- 
- 	if (!EQ_CRC32(record->xl_crc, crc))
- 	{
- 		ereport(emode_for_corrupt_record(emode, recptr),
- 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
- 				(uint32) (recptr >> 32), (uint32) recptr)));
- 		return false;
- 	}
- 
- 	return true;
- }
- 
- /*
   * Attempt to read an XLOG record.
   *
   * If RecPtr is not NULL, try to read a record at that position.  Otherwise
--- 3188,3193 ----
***************
*** 3287,3608 **** RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
   * the returned record pointer always points there.
   */
  static XLogRecord *
! ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
  {
  	XLogRecord *record;
! 	XLogRecPtr	tmpRecPtr = EndRecPtr;
! 	bool		randAccess = false;
! 	uint32		len,
! 				total_len;
! 	uint32		targetRecOff;
! 	uint32		pageHeaderSize;
! 	bool		gotheader;
! 
! 	if (readBuf == NULL)
! 	{
! 		/*
! 		 * First time through, permanently allocate readBuf.  We do it this
! 		 * way, rather than just making a static array, for two reasons: (1)
! 		 * no need to waste the storage in most instantiations of the backend;
! 		 * (2) a static char array isn't guaranteed to have any particular
! 		 * alignment, whereas malloc() will provide MAXALIGN'd storage.
! 		 */
! 		readBuf = (char *) malloc(XLOG_BLCKSZ);
! 		Assert(readBuf != NULL);
! 	}
! 
! 	if (RecPtr == NULL)
! 	{
! 		RecPtr = &tmpRecPtr;
! 
! 		/*
! 		 * RecPtr is pointing to end+1 of the previous WAL record.  If
! 		 * we're at a page boundary, no more records can fit on the current
! 		 * page. We must skip over the page header, but we can't do that
! 		 * until we've read in the page, since the header size is variable.
! 		 */
! 	}
! 	else
! 	{
! 		/*
! 		 * In this case, the passed-in record pointer should already be
! 		 * pointing to a valid record starting position.
! 		 */
! 		if (!XRecOffIsValid(*RecPtr))
! 			ereport(PANIC,
! 					(errmsg("invalid record offset at %X/%X",
! 							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
  
! 		/*
! 		 * Since we are going to a random position in WAL, forget any prior
! 		 * state about what timeline we were in, and allow it to be any
! 		 * timeline in expectedTLIs.  We also set a flag to allow curFileTLI
! 		 * to go backwards (but we can't reset that variable right here, since
! 		 * we might not change files at all).
! 		 */
  		/* see comment in ValidXLogPageHeader */
! 		lastPageTLI = lastSegmentTLI = 0;
! 		randAccess = true;		/* allow curFileTLI to go backwards too */
! 	}
  
! 	/* This is the first try to read this page. */
! 	failedSources = 0;
! retry:
! 	/* Read the page containing the record */
! 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
! 		return NULL;
! 
! 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
! 	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
! 	if (targetRecOff == 0)
! 	{
! 		/*
! 		 * At page start, so skip over page header.  The Assert checks that
! 		 * we're not scribbling on caller's record pointer; it's OK because we
! 		 * can only get here in the continuing-from-prev-record case, since
! 		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
! 		 */
! 		Assert(RecPtr == &tmpRecPtr);
! 		(*RecPtr) += pageHeaderSize;
! 		targetRecOff = pageHeaderSize;
! 	}
! 	else if (targetRecOff < pageHeaderSize)
! 	{
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
! 				(errmsg("invalid record offset at %X/%X",
! 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 		goto next_record_is_invalid;
! 	}
! 	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
! 		targetRecOff == pageHeaderSize)
! 	{
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
! 				(errmsg("contrecord is requested by %X/%X",
! 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 		goto next_record_is_invalid;
! 	}
  
! 	/*
! 	 * Read the record length.
! 	 *
! 	 * NB: Even though we use an XLogRecord pointer here, the whole record
! 	 * header might not fit on this page. xl_tot_len is the first field of
! 	 * the struct, so it must be on this page (the records are MAXALIGNed),
! 	 * but we cannot access any other fields until we've verified that we
! 	 * got the whole header.
! 	 */
! 	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
! 	total_len = record->xl_tot_len;
! 
! 	/*
! 	 * If the whole record header is on this page, validate it immediately.
! 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
! 	 * rest of the header after reading it from the next page.  The xl_tot_len
! 	 * check is necessary here to ensure that we enter the "Need to reassemble
! 	 * record" code path below; otherwise we might fail to apply
! 	 * ValidXLogRecordHeader at all.
! 	 */
! 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
! 	{
! 		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
! 			goto next_record_is_invalid;
! 		gotheader = true;
! 	}
! 	else
! 	{
! 		if (total_len < SizeOfXLogRecord)
! 		{
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
! 					(errmsg("invalid record length at %X/%X",
! 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 			goto next_record_is_invalid;
! 		}
! 		gotheader = false;
! 	}
! 
! 	/*
! 	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
! 	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
! 	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
! 	 * enough for all "normal" records, but very large commit or abort records
! 	 * might need more space.)
! 	 */
! 	if (total_len > readRecordBufSize)
! 	{
! 		uint32		newSize = total_len;
! 
! 		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
! 		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
! 		if (readRecordBuf)
! 			free(readRecordBuf);
! 		readRecordBuf = (char *) malloc(newSize);
! 		if (!readRecordBuf)
! 		{
! 			readRecordBufSize = 0;
! 			/* We treat this as a "bogus data" condition */
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
! 					(errmsg("record length %u at %X/%X too long",
! 							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
! 			goto next_record_is_invalid;
! 		}
! 		readRecordBufSize = newSize;
! 	}
! 
! 	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
! 	if (total_len > len)
  	{
! 		/* Need to reassemble record */
! 		char	   *contrecord;
! 		XLogPageHeader pageHeader;
! 		XLogRecPtr	pagelsn;
! 		char	   *buffer;
! 		uint32		gotlen;
! 
! 		/* Initialize pagelsn to the beginning of the page this record is on */
! 		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 
! 		/* Copy the first fragment of the record from the first page. */
! 		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
! 		buffer = readRecordBuf + len;
! 		gotlen = len;
! 
! 		do
  		{
! 			/* Calculate pointer to beginning of next page */
! 			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
! 			/* Wait for the next page to become available */
! 			if (!XLogPageRead(&pagelsn, emode, false, false))
! 				return NULL;
! 
! 			/* Check that the continuation on next page looks valid */
! 			pageHeader = (XLogPageHeader) readBuf;
! 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
! 			{
! 				ereport(emode_for_corrupt_record(emode, *RecPtr),
! 						(errmsg("there is no contrecord flag in log segment %s, offset %u",
! 								XLogFileNameP(curFileTLI, readSegNo),
! 								readOff)));
! 				goto next_record_is_invalid;
! 			}
! 			/*
! 			 * Cross-check that xlp_rem_len agrees with how much of the record
! 			 * we expect there to be left.
! 			 */
! 			if (pageHeader->xlp_rem_len == 0 ||
! 				total_len != (pageHeader->xlp_rem_len + gotlen))
! 			{
! 				ereport(emode_for_corrupt_record(emode, *RecPtr),
! 						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
! 								pageHeader->xlp_rem_len,
! 								XLogFileNameP(curFileTLI, readSegNo),
! 								readOff)));
! 				goto next_record_is_invalid;
! 			}
  
! 			/* Append the continuation from this page to the buffer */
! 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
! 			contrecord = (char *) readBuf + pageHeaderSize;
! 			len = XLOG_BLCKSZ - pageHeaderSize;
! 			if (pageHeader->xlp_rem_len < len)
! 				len = pageHeader->xlp_rem_len;
! 			memcpy(buffer, (char *) contrecord, len);
! 			buffer += len;
! 			gotlen += len;
! 
! 			/* If we just reassembled the record header, validate it. */
! 			if (!gotheader)
  			{
! 				record = (XLogRecord *) readRecordBuf;
! 				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
! 					goto next_record_is_invalid;
! 				gotheader = true;
  			}
! 		} while (pageHeader->xlp_rem_len > len);
! 
! 		record = (XLogRecord *) readRecordBuf;
! 		if (!RecordIsValid(record, *RecPtr, emode))
! 			goto next_record_is_invalid;
! 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
! 		XLogSegNoOffsetToRecPtr(
! 			readSegNo,
! 			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
! 			EndRecPtr);
! 		ReadRecPtr = *RecPtr;
! 	}
! 	else
! 	{
! 		/* Record does not cross a page boundary */
! 		if (!RecordIsValid(record, *RecPtr, emode))
! 			goto next_record_is_invalid;
! 		EndRecPtr = *RecPtr + MAXALIGN(total_len);
! 
! 		ReadRecPtr = *RecPtr;
! 		memcpy(readRecordBuf, record, total_len);
! 	}
! 
! 	/*
! 	 * Special processing if it's an XLOG SWITCH record
! 	 */
! 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
! 	{
! 		/* Pretend it extends to end of segment */
! 		EndRecPtr += XLogSegSize - 1;
! 		EndRecPtr -= EndRecPtr % XLogSegSize;
  
- 		/*
- 		 * Pretend that readBuf contains the last page of the segment. This is
- 		 * just to avoid Assert failure in StartupXLOG if XLOG ends with this
- 		 * segment.
- 		 */
- 		readOff = XLogSegSize - XLOG_BLCKSZ;
- 	}
  	return record;
- 
- next_record_is_invalid:
- 	failedSources |= readSource;
- 
- 	if (readFile >= 0)
- 	{
- 		close(readFile);
- 		readFile = -1;
- 	}
- 
- 	/* In standby-mode, keep trying */
- 	if (StandbyMode)
- 		goto retry;
- 	else
- 		return NULL;
  }
  
  /*
   * Check whether the xlog header of a page just read in looks valid.
   *
   * This is just a convenience subroutine to avoid duplicated code in
!  * ReadRecord.	It's not intended for use from anywhere else.
   */
  static bool
! ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  {
  	XLogRecPtr	recaddr;
  
! 	XLogSegNoOffsetToRecPtr(readSegNo, readOff, recaddr);
  
  	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("invalid magic number %04X in log segment %s, offset %u",
! 						hdr->xlp_magic,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	if (hdr->xlp_info & XLP_LONG_HEADER)
--- 3200,3270 ----
   * the returned record pointer always points there.
   */
  static XLogRecord *
! ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
! 		   bool fetching_ckpt)
  {
  	XLogRecord *record;
! 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
  
! 	if (!XLogRecPtrIsInvalid(RecPtr))
  		/* see comment in ValidXLogPageHeader */
! 		lastPageTLI = lastSegmentTLI = 0;		
  
! 	/* Set flag for XLogPageRead */
! 	private->fetching_ckpt = fetching_ckpt;
  
! 	/* This is the first try to read this page. */
! 	private->failedSources = 0;
! 	do
  	{
! 		record = XLogReadRecord(xlogreader, RecPtr, emode);
! 		ReadRecPtr = xlogreader->ReadRecPtr;
! 		EndRecPtr = xlogreader->EndRecPtr;
! 		if (record == NULL)
  		{
! 			private->failedSources |= private->readSource;
  
! 			if (private->readFile >= 0)
  			{
! 				close(private->readFile);
! 				private->readFile = -1;
  			}
! 		}
! 	} while (StandbyMode && record == NULL);
  
  	return record;
  }
  
  /*
   * Check whether the xlog header of a page just read in looks valid.
   *
   * This is just a convenience subroutine to avoid duplicated code in
!  * XLogPageRead.  It's not intended for use from anywhere else.
   */
  static bool
! ValidXLogPageHeader(XLogSegNo segno, uint32 offset, int source,
! 					XLogPageHeader hdr, int emode, bool segmentonly)
  {
  	XLogRecPtr	recaddr;
  
! 	XLogSegNoOffsetToRecPtr(segno, offset, recaddr);
  
  	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("invalid magic number %04X in log segment %s, offset %u",
! 					hdr->xlp_magic,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  	if (hdr->xlp_info & XLP_LONG_HEADER)
***************
*** 3622,3628 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  					 longhdr->xlp_sysid);
  			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
  					 ControlFile->system_identifier);
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
  							   fhdrident_str, sysident_str)));
--- 3284,3290 ----
  					 longhdr->xlp_sysid);
  			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
  					 ControlFile->system_identifier);
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
  							   fhdrident_str, sysident_str)));
***************
*** 3630,3666 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  		}
  		if (longhdr->xlp_seg_size != XLogSegSize)
  		{
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
  			return false;
  		}
  		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
  		{
! 			ereport(emode_for_corrupt_record(emode, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
  			return false;
  		}
  	}
! 	else if (readOff == 0)
  	{
  		/* hmm, first page of file doesn't have a long header? */
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
  	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
! 						(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
--- 3292,3328 ----
  		}
  		if (longhdr->xlp_seg_size != XLogSegSize)
  		{
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
  			return false;
  		}
  		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
  		{
! 			ereport(emode_for_corrupt_record(emode, source, recaddr),
  					(errmsg("WAL file is from different database system"),
  					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
  			return false;
  		}
  	}
! 	else if (offset == 0)
  	{
  		/* hmm, first page of file doesn't have a long header? */
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("invalid info bits %04X in log segment %s, offset %u",
  						hdr->xlp_info,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  
  	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
! 			  (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  
***************
*** 3669,3679 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  	 */
  	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
! 				(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
! 						hdr->xlp_tli,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  
--- 3331,3341 ----
  	 */
  	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
! 			(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
! 					hdr->xlp_tli,
! 					XLogFileNameP(curFileTLI, segno),
! 					offset)));
  		return false;
  	}
  
***************
*** 3697,3708 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  	 */
  	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
  	{
! 		ereport(emode_for_corrupt_record(emode, recaddr),
  				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
  						hdr->xlp_tli,
  						segmentonly ? lastSegmentTLI : lastPageTLI,
! 						XLogFileNameP(curFileTLI, readSegNo),
! 						readOff)));
  		return false;
  	}
  	lastPageTLI = hdr->xlp_tli;
--- 3359,3370 ----
  	 */
  	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
  	{
! 		ereport(emode_for_corrupt_record(emode, source, recaddr),
  				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
  						hdr->xlp_tli,
  						segmentonly ? lastSegmentTLI : lastPageTLI,
! 						XLogFileNameP(curFileTLI, segno),
! 						offset)));
  		return false;
  	}
  	lastPageTLI = hdr->xlp_tli;
***************
*** 3713,3800 **** ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
  }
  
  /*
-  * Validate an XLOG record header.
-  *
-  * This is just a convenience subroutine to avoid duplicated code in
-  * ReadRecord.	It's not intended for use from anywhere else.
-  */
- static bool
- ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
- 					  bool randAccess)
- {
- 	/*
- 	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
- 	 * required.
- 	 */
- 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- 	{
- 		if (record->xl_len != 0)
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("invalid xlog switch record at %X/%X",
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 	else if (record->xl_len == 0)
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("record with zero length at %X/%X",
- 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
- 		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
- 		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("invalid record length at %X/%X",
- 						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (record->xl_rmid > RM_MAX_ID)
- 	{
- 		ereport(emode_for_corrupt_record(emode, *RecPtr),
- 				(errmsg("invalid resource manager ID %u at %X/%X",
- 						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 		return false;
- 	}
- 	if (randAccess)
- 	{
- 		/*
- 		 * We can't exactly verify the prev-link, but surely it should be less
- 		 * than the record's own address.
- 		 */
- 		if (!XLByteLT(record->xl_prev, *RecPtr))
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
- 							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 	else
- 	{
- 		/*
- 		 * Record's prev-link should exactly match our previous location. This
- 		 * check guards against torn WAL pages where a stale but valid-looking
- 		 * WAL record starts on a sector boundary.
- 		 */
- 		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
- 		{
- 			ereport(emode_for_corrupt_record(emode, *RecPtr),
- 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
- 							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- 							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- 			return false;
- 		}
- 	}
- 
- 	return true;
- }
- 
- /*
   * Scan for new timelines that might have appeared in the archive since we
   * started recovery.
   *
--- 3375,3380 ----
***************
*** 4755,4761 **** readRecoveryCommandFile(void)
   * Exit archive-recovery state
   */
  static void
! exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
  {
  	char		recoveryPath[MAXPGPATH];
  	char		xlogpath[MAXPGPATH];
--- 4335,4342 ----
   * Exit archive-recovery state
   */
  static void
! exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
! 					XLogSegNo endLogSegNo)
  {
  	char		recoveryPath[MAXPGPATH];
  	char		xlogpath[MAXPGPATH];
***************
*** 4774,4783 **** exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
  	 * If the ending log segment is still open, close it (to avoid problems on
  	 * Windows with trying to rename or delete an open file).
  	 */
! 	if (readFile >= 0)
  	{
! 		close(readFile);
! 		readFile = -1;
  	}
  
  	/*
--- 4355,4364 ----
  	 * If the ending log segment is still open, close it (to avoid problems on
  	 * Windows with trying to rename or delete an open file).
  	 */
! 	if (private->readFile >= 0)
  	{
! 		close(private->readFile);
! 		private->readFile = -1;
  	}
  
  	/*
***************
*** 5212,5217 **** StartupXLOG(void)
--- 4793,4800 ----
  	bool		backupEndRequired = false;
  	bool		backupFromStandby = false;
  	DBState		dbstate_at_startup;
+ 	XLogReaderState *xlogreader;
+ 	XLogPageReadPrivate *private;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 5345,5350 **** StartupXLOG(void)
--- 4928,4942 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
+ 	private = palloc0(sizeof(XLogPageReadPrivate));
+ 	private->readFile = -1;
+ 	xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, private);
+ 	if (!xlogreader)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory"),
+ 				 errdetail("Failed while allocating an XLog reading processor")));
+ 
  	if (read_backup_label(&checkPointLoc, &backupEndRequired,
  						  &backupFromStandby))
  	{
***************
*** 5352,5365 **** StartupXLOG(void)
  		 * When a backup_label file is present, we want to roll forward from
  		 * the checkpoint it identifies, rather than using pg_control.
  		 */
! 		record = ReadCheckpointRecord(checkPointLoc, 0);
  		if (record != NULL)
  		{
  			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  			wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  			InRecovery = true;	/* force recovery even if SHUTDOWNED */
  
  			/*
--- 4944,4957 ----
  		 * When a backup_label file is present, we want to roll forward from
  		 * the checkpoint it identifies, rather than using pg_control.
  		 */
! 		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
  		if (record != NULL)
  		{
  			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  			wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  			InRecovery = true;	/* force recovery even if SHUTDOWNED */
  
  			/*
***************
*** 5370,5376 **** StartupXLOG(void)
  			 */
  			if (XLByteLT(checkPoint.redo, checkPointLoc))
  			{
! 				if (!ReadRecord(&(checkPoint.redo), LOG, false))
  					ereport(FATAL,
  							(errmsg("could not find redo location referenced by checkpoint record"),
  							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
--- 4962,4968 ----
  			 */
  			if (XLByteLT(checkPoint.redo, checkPointLoc))
  			{
! 				if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
  					ereport(FATAL,
  							(errmsg("could not find redo location referenced by checkpoint record"),
  							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
***************
*** 5394,5405 **** StartupXLOG(void)
  		 */
  		checkPointLoc = ControlFile->checkPoint;
  		RedoStartLSN = ControlFile->checkPointCopy.redo;
! 		record = ReadCheckpointRecord(checkPointLoc, 1);
  		if (record != NULL)
  		{
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  		}
  		else if (StandbyMode)
  		{
--- 4986,4997 ----
  		 */
  		checkPointLoc = ControlFile->checkPoint;
  		RedoStartLSN = ControlFile->checkPointCopy.redo;
! 		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
  		if (record != NULL)
  		{
  			ereport(DEBUG1,
  					(errmsg("checkpoint record is at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  		}
  		else if (StandbyMode)
  		{
***************
*** 5413,5424 **** StartupXLOG(void)
  		else
  		{
  			checkPointLoc = ControlFile->prevCheckPoint;
! 			record = ReadCheckpointRecord(checkPointLoc, 2);
  			if (record != NULL)
  			{
  				ereport(LOG,
  						(errmsg("using previous checkpoint record at %X/%X",
! 								(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  				InRecovery = true;		/* force recovery even if SHUTDOWNED */
  			}
  			else
--- 5005,5016 ----
  		else
  		{
  			checkPointLoc = ControlFile->prevCheckPoint;
! 			record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
  			if (record != NULL)
  			{
  				ereport(LOG,
  						(errmsg("using previous checkpoint record at %X/%X",
! 				   (uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
  				InRecovery = true;		/* force recovery even if SHUTDOWNED */
  			}
  			else
***************
*** 5433,5439 **** StartupXLOG(void)
  
  	ereport(DEBUG1,
  			(errmsg("redo record is at %X/%X; shutdown %s",
! 					(uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
  					wasShutdown ? "TRUE" : "FALSE")));
  	ereport(DEBUG1,
  			(errmsg("next transaction ID: %u/%u; next OID: %u",
--- 5025,5031 ----
  
  	ereport(DEBUG1,
  			(errmsg("redo record is at %X/%X; shutdown %s",
! 				  (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
  					wasShutdown ? "TRUE" : "FALSE")));
  	ereport(DEBUG1,
  			(errmsg("next transaction ID: %u/%u; next OID: %u",
***************
*** 5714,5720 **** StartupXLOG(void)
  		 * Allow read-only connections immediately if we're consistent
  		 * already.
  		 */
! 		CheckRecoveryConsistency();
  
  		/*
  		 * Find the first record that logically follows the checkpoint --- it
--- 5306,5312 ----
  		 * Allow read-only connections immediately if we're consistent
  		 * already.
  		 */
! 		CheckRecoveryConsistency(EndRecPtr);
  
  		/*
  		 * Find the first record that logically follows the checkpoint --- it
***************
*** 5723,5734 **** StartupXLOG(void)
  		if (XLByteLT(checkPoint.redo, RecPtr))
  		{
  			/* back up to find the record */
! 			record = ReadRecord(&(checkPoint.redo), PANIC, false);
  		}
  		else
  		{
  			/* just have to read next record after CheckPoint */
! 			record = ReadRecord(NULL, LOG, false);
  		}
  
  		if (record != NULL)
--- 5315,5326 ----
  		if (XLByteLT(checkPoint.redo, RecPtr))
  		{
  			/* back up to find the record */
! 			record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
  		}
  		else
  		{
  			/* just have to read next record after CheckPoint */
! 			record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
  		}
  
  		if (record != NULL)
***************
*** 5743,5749 **** StartupXLOG(void)
  
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  
  			/*
  			 * main redo apply loop
--- 5335,5341 ----
  
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
! 						 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  
  			/*
  			 * main redo apply loop
***************
*** 5759,5766 **** StartupXLOG(void)
  
  					initStringInfo(&buf);
  					appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
! 									 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
! 									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr);
  					xlog_outrec(&buf, record);
  					appendStringInfo(&buf, " - ");
  					RmgrTable[record->xl_rmid].rm_desc(&buf,
--- 5351,5358 ----
  
  					initStringInfo(&buf);
  					appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
! 							 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr);
  					xlog_outrec(&buf, record);
  					appendStringInfo(&buf, " - ");
  					RmgrTable[record->xl_rmid].rm_desc(&buf,
***************
*** 5775,5781 **** StartupXLOG(void)
  				HandleStartupProcInterrupts();
  
  				/* Allow read-only connections if we're consistent now */
! 				CheckRecoveryConsistency();
  
  				/*
  				 * Have we reached our recovery target?
--- 5367,5373 ----
  				HandleStartupProcInterrupts();
  
  				/* Allow read-only connections if we're consistent now */
! 				CheckRecoveryConsistency(EndRecPtr);
  
  				/*
  				 * Have we reached our recovery target?
***************
*** 5879,5885 **** StartupXLOG(void)
  
  				LastRec = ReadRecPtr;
  
! 				record = ReadRecord(NULL, LOG, false);
  			} while (record != NULL && recoveryContinue);
  
  			/*
--- 5471,5477 ----
  
  				LastRec = ReadRecPtr;
  
! 				record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
  			} while (record != NULL && recoveryContinue);
  
  			/*
***************
*** 5888,5894 **** StartupXLOG(void)
  
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
! 							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  			xtime = GetLatestXTime();
  			if (xtime)
  				ereport(LOG,
--- 5480,5486 ----
  
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
! 						 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
  			xtime = GetLatestXTime();
  			if (xtime)
  				ereport(LOG,
***************
*** 5929,5935 **** StartupXLOG(void)
  	 * Re-fetch the last valid or last applied record, so we can identify the
  	 * exact endpoint of what we consider the valid portion of WAL.
  	 */
! 	record = ReadRecord(&LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
  	XLByteToPrevSeg(EndOfLog, endLogSegNo);
  
--- 5521,5527 ----
  	 * Re-fetch the last valid or last applied record, so we can identify the
  	 * exact endpoint of what we consider the valid portion of WAL.
  	 */
! 	record = ReadRecord(xlogreader, LastRec, PANIC, false);
  	EndOfLog = EndRecPtr;
  	XLByteToPrevSeg(EndOfLog, endLogSegNo);
  
***************
*** 5992,5998 **** StartupXLOG(void)
  	 */
  	if (InArchiveRecovery)
  	{
! 		char	reason[200];
  
  		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
  		ereport(LOG,
--- 5584,5590 ----
  	 */
  	if (InArchiveRecovery)
  	{
! 		char		reason[200];
  
  		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
  		ereport(LOG,
***************
*** 6033,6039 **** StartupXLOG(void)
  	 * we will use that below.)
  	 */
  	if (InArchiveRecovery)
! 		exitArchiveRecovery(curFileTLI, endLogSegNo);
  
  	/*
  	 * Prepare to write WAL starting at EndOfLog position, and init xlog
--- 5625,5631 ----
  	 * we will use that below.)
  	 */
  	if (InArchiveRecovery)
! 		exitArchiveRecovery(private, curFileTLI, endLogSegNo);
  
  	/*
  	 * Prepare to write WAL starting at EndOfLog position, and init xlog
***************
*** 6052,6059 **** StartupXLOG(void)
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
! 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
  	Insert->currpos = (char *) Insert->currpage +
  		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
  
--- 5644,5658 ----
  	 * record spans, not the one it starts in.	The last block is indeed the
  	 * one we want to use.
  	 */
! 	if (EndOfLog % XLOG_BLCKSZ == 0)
! 	{
! 		memset(Insert->currpage, 0, XLOG_BLCKSZ);
! 	}
! 	else
! 	{
! 		Assert(private->readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
! 		memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
! 	}
  	Insert->currpos = (char *) Insert->currpage +
  		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
  
***************
*** 6205,6226 **** StartupXLOG(void)
  		ShutdownRecoveryTransactionEnvironment();
  
  	/* Shut down readFile facility, free space */
! 	if (readFile >= 0)
! 	{
! 		close(readFile);
! 		readFile = -1;
! 	}
! 	if (readBuf)
  	{
! 		free(readBuf);
! 		readBuf = NULL;
! 	}
! 	if (readRecordBuf)
! 	{
! 		free(readRecordBuf);
! 		readRecordBuf = NULL;
! 		readRecordBufSize = 0;
  	}
  
  	/*
  	 * If any of the critical GUCs have changed, log them before we allow
--- 5804,5818 ----
  		ShutdownRecoveryTransactionEnvironment();
  
  	/* Shut down readFile facility, free space */
! 	private = (XLogPageReadPrivate *) xlogreader->private_data;
! 	if (private->readFile >= 0)
  	{
! 		close(private->readFile);
! 		private->readFile = -1;
  	}
+ 	if (xlogreader->private_data)
+ 		free(xlogreader->private_data);
+ 	XLogReaderFree(xlogreader);
  
  	/*
  	 * If any of the critical GUCs have changed, log them before we allow
***************
*** 6251,6257 **** StartupXLOG(void)
   * that it can start accepting read-only connections.
   */
  static void
! CheckRecoveryConsistency(void)
  {
  	/*
  	 * During crash recovery, we don't reach a consistent state until we've
--- 5843,5849 ----
   * that it can start accepting read-only connections.
   */
  static void
! CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
  {
  	/*
  	 * During crash recovery, we don't reach a consistent state until we've
***************
*** 6431,6437 **** LocalSetXLogInsertAllowed(void)
   * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
   */
  static XLogRecord *
! ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
  {
  	XLogRecord *record;
  
--- 6023,6029 ----
   * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
   */
  static XLogRecord *
! ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
  {
  	XLogRecord *record;
  
***************
*** 6455,6461 **** ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
  		return NULL;
  	}
  
! 	record = ReadRecord(&RecPtr, LOG, true);
  
  	if (record == NULL)
  	{
--- 6047,6053 ----
  		return NULL;
  	}
  
! 	record = ReadRecord(xlogreader, RecPtr, LOG, true);
  
  	if (record == NULL)
  	{
***************
*** 6683,6689 **** GetRecoveryTargetTLI(void)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
! 	TimeLineID result;
  
  	SpinLockAcquire(&xlogctl->info_lck);
  	result = xlogctl->RecoveryTargetTLI;
--- 6275,6281 ----
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
! 	TimeLineID	result;
  
  	SpinLockAcquire(&xlogctl->info_lck);
  	result = xlogctl->RecoveryTargetTLI;
***************
*** 6968,6974 **** CreateCheckPoint(int flags)
  		XLogRecPtr	curInsert;
  
  		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert == ControlFile->checkPoint + 
  			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
  			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
  		{
--- 6560,6566 ----
  		XLogRecPtr	curInsert;
  
  		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert == ControlFile->checkPoint +
  			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
  			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
  		{
***************
*** 7398,7404 **** CreateRestartPoint(int flags)
  	{
  		ereport(DEBUG2,
  				(errmsg("skipping restartpoint, already performed at %X/%X",
! 						(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo)));
  
  		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  		if (flags & CHECKPOINT_IS_SHUTDOWN)
--- 6990,6996 ----
  	{
  		ereport(DEBUG2,
  				(errmsg("skipping restartpoint, already performed at %X/%X",
! 		(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo)));
  
  		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  		if (flags & CHECKPOINT_IS_SHUTDOWN)
***************
*** 7508,7514 **** CreateRestartPoint(int flags)
  	xtime = GetLatestXTime();
  	ereport((log_checkpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
  		   xtime ? errdetail("last completed transaction was at log time %s",
  							 timestamptz_to_str(xtime)) : 0));
  
--- 7100,7106 ----
  	xtime = GetLatestXTime();
  	ereport((log_checkpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 		 (uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
  		   xtime ? errdetail("last completed transaction was at log time %s",
  							 timestamptz_to_str(xtime)) : 0));
  
***************
*** 8033,8039 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  				   "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
! 						 (uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
--- 7625,7631 ----
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  				   "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
! 				(uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
***************
*** 8214,8220 **** assign_xlog_sync_method(int new_sync_method, void *extra)
  				ereport(PANIC,
  						(errcode_for_file_access(),
  						 errmsg("could not fsync log segment %s: %m",
! 								XLogFileNameP(ThisTimeLineID, openLogSegNo))));
  			if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
  				XLogFileClose();
  		}
--- 7806,7812 ----
  				ereport(PANIC,
  						(errcode_for_file_access(),
  						 errmsg("could not fsync log segment %s: %m",
! 							  XLogFileNameP(ThisTimeLineID, openLogSegNo))));
  			if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
  				XLogFileClose();
  		}
***************
*** 8245,8252 **** issue_xlog_fsync(int fd, XLogSegNo segno)
  			if (pg_fsync_writethrough(fd) != 0)
  				ereport(PANIC,
  						(errcode_for_file_access(),
! 						 errmsg("could not fsync write-through log file %s: %m",
! 								XLogFileNameP(ThisTimeLineID, segno))));
  			break;
  #endif
  #ifdef HAVE_FDATASYNC
--- 7837,7844 ----
  			if (pg_fsync_writethrough(fd) != 0)
  				ereport(PANIC,
  						(errcode_for_file_access(),
! 					  errmsg("could not fsync write-through log file %s: %m",
! 							 XLogFileNameP(ThisTimeLineID, segno))));
  			break;
  #endif
  #ifdef HAVE_FDATASYNC
***************
*** 8275,8280 **** char *
--- 7867,7873 ----
  XLogFileNameP(TimeLineID tli, XLogSegNo segno)
  {
  	char	   *result = palloc(MAXFNAMELEN);
+ 
  	XLogFileName(result, tli, segno);
  	return result;
  }
***************
*** 8520,8528 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
! 						 (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename);
  		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
! 						 (uint32) (checkpointloc >> 32), (uint32) checkpointloc);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
--- 8113,8121 ----
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
! 			 (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename);
  		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
! 					 (uint32) (checkpointloc >> 32), (uint32) checkpointloc);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
***************
*** 8870,8876 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  				 errmsg("could not create file \"%s\": %m",
  						histfilepath)));
  	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 			(uint32) (startpoint >> 32), (uint32) startpoint, startxlogfilename);
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			(uint32) (stoppoint >> 32), (uint32) stoppoint, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
--- 8463,8469 ----
  				 errmsg("could not create file \"%s\": %m",
  						histfilepath)));
  	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 		(uint32) (startpoint >> 32), (uint32) startpoint, startxlogfilename);
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			(uint32) (stoppoint >> 32), (uint32) stoppoint, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
***************
*** 9261,9287 **** CancelBackup(void)
   * sleep and retry.
   */
  static bool
! XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
! 			 bool randAccess)
  {
  	uint32		targetPageOff;
  	uint32		targetRecOff;
  	XLogSegNo	targetSegNo;
  
! 	XLByteToSeg(*RecPtr, targetSegNo);
! 	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
  
  	/* Fast exit if we have read the record in the current buffer already */
! 	if (failedSources == 0 && targetSegNo == readSegNo &&
! 		targetPageOff == readOff && targetRecOff < readLen)
  		return true;
  
  	/*
  	 * See if we need to switch to a new segment because the requested record
  	 * is not in the currently open one.
  	 */
! 	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
  	{
  		/*
  		 * Request a restartpoint if we've replayed too much xlog since the
--- 8854,8881 ----
   * sleep and retry.
   */
  static bool
! XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
! 			 bool randAccess, char *readBuf, void *private_data)
  {
+ 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) private_data;
  	uint32		targetPageOff;
  	uint32		targetRecOff;
  	XLogSegNo	targetSegNo;
  
! 	XLByteToSeg(RecPtr, targetSegNo);
! 	targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
! 	targetRecOff = RecPtr % XLOG_BLCKSZ;
  
  	/* Fast exit if we have read the record in the current buffer already */
! 	if (private->failedSources == 0 && targetSegNo == private->readSegNo &&
! 		targetPageOff == private->readOff && targetRecOff < private->readLen)
  		return true;
  
  	/*
  	 * See if we need to switch to a new segment because the requested record
  	 * is not in the currently open one.
  	 */
! 	if (private->readFile >= 0 && !XLByteInSeg(RecPtr, private->readSegNo))
  	{
  		/*
  		 * Request a restartpoint if we've replayed too much xlog since the
***************
*** 9289,9324 **** XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
  		 */
  		if (StandbyMode && bgwriterLaunched)
  		{
! 			if (XLogCheckpointNeeded(readSegNo))
  			{
  				(void) GetRedoRecPtr();
! 				if (XLogCheckpointNeeded(readSegNo))
  					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  			}
  		}
  
! 		close(readFile);
! 		readFile = -1;
! 		readSource = 0;
  	}
  
! 	XLByteToSeg(*RecPtr, readSegNo);
  
  retry:
  	/* See if we need to retrieve more data */
! 	if (readFile < 0 ||
! 		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
  	{
  		if (StandbyMode)
  		{
! 			if (!WaitForWALToBecomeAvailable(*RecPtr, randAccess,
! 											 fetching_ckpt))
  				goto triggered;
  		}
  		else
  		{
  			/* In archive or crash recovery. */
! 			if (readFile < 0)
  			{
  				int			sources;
  
--- 8883,8918 ----
  		 */
  		if (StandbyMode && bgwriterLaunched)
  		{
! 			if (XLogCheckpointNeeded(private->readSegNo))
  			{
  				(void) GetRedoRecPtr();
! 				if (XLogCheckpointNeeded(private->readSegNo))
  					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
  			}
  		}
  
! 		close(private->readFile);
! 		private->readFile = -1;
! 		private->readSource = 0;
  	}
  
! 	XLByteToSeg(RecPtr, private->readSegNo);
  
  retry:
  	/* See if we need to retrieve more data */
! 	if (private->readFile < 0 ||
! 		(private->readSource == XLOG_FROM_STREAM &&
! 		 !XLByteLT(RecPtr, receivedUpto)))
  	{
  		if (StandbyMode)
  		{
! 			if (!WaitForWALToBecomeAvailable(private, RecPtr, randAccess))
  				goto triggered;
  		}
  		else
  		{
  			/* In archive or crash recovery. */
! 			if (private->readFile < 0)
  			{
  				int			sources;
  
***************
*** 9330,9337 **** retry:
  				if (InArchiveRecovery)
  					sources |= XLOG_FROM_ARCHIVE;
  
! 				readFile = XLogFileReadAnyTLI(readSegNo, emode, sources);
! 				if (readFile < 0)
  					return false;
  			}
  		}
--- 8924,8933 ----
  				if (InArchiveRecovery)
  					sources |= XLOG_FROM_ARCHIVE;
  
! 				private->readFile =
! 					XLogFileReadAnyTLI(private, private->readSegNo, emode,
! 									   sources);
! 				if (private->readFile < 0)
  					return false;
  			}
  		}
***************
*** 9341,9347 **** retry:
  	 * At this point, we have the right segment open and if we're streaming we
  	 * know the requested record is in it.
  	 */
! 	Assert(readFile != -1);
  
  	/*
  	 * If the current segment is being streamed from master, calculate how
--- 8937,8943 ----
  	 * At this point, we have the right segment open and if we're streaming we
  	 * know the requested record is in it.
  	 */
! 	Assert(private->readFile != -1);
  
  	/*
  	 * If the current segment is being streamed from master, calculate how
***************
*** 9349,9367 **** retry:
  	 * requested record has been received, but this is for the benefit of
  	 * future calls, to allow quick exit at the top of this function.
  	 */
! 	if (readSource == XLOG_FROM_STREAM)
  	{
! 		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
  		{
! 			readLen = XLOG_BLCKSZ;
  		}
  		else
! 			readLen = receivedUpto % XLogSegSize - targetPageOff;
  	}
  	else
! 		readLen = XLOG_BLCKSZ;
  
! 	if (!readFileHeaderValidated && targetPageOff != 0)
  	{
  		/*
  		 * Whenever switching to a new WAL segment, we read the first page of
--- 8945,8963 ----
  	 * requested record has been received, but this is for the benefit of
  	 * future calls, to allow quick exit at the top of this function.
  	 */
! 	if (private->readSource == XLOG_FROM_STREAM)
  	{
! 		if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
  		{
! 			private->readLen = XLOG_BLCKSZ;
  		}
  		else
! 			private->readLen = receivedUpto % XLogSegSize - targetPageOff;
  	}
  	else
! 		private->readLen = XLOG_BLCKSZ;
  
! 	if (!private->readFileHeaderValidated && targetPageOff != 0)
  	{
  		/*
  		 * Whenever switching to a new WAL segment, we read the first page of
***************
*** 9370,9431 **** retry:
  		 * identification info that is present in the first page's "long"
  		 * header.
  		 */
! 		readOff = 0;
! 		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  		{
! 			char fname[MAXFNAMELEN];
! 			XLogFileName(fname, curFileTLI, readSegNo);
! 			ereport(emode_for_corrupt_record(emode, *RecPtr),
  					(errcode_for_file_access(),
! 					 errmsg("could not read from log segment %s, offset %u: %m",
! 							fname, readOff)));
  			goto next_record_is_invalid;
  		}
! 		if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode, true))
  			goto next_record_is_invalid;
  	}
  
  	/* Read the requested page */
! 	readOff = targetPageOff;
! 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
  	{
! 		char fname[MAXFNAMELEN];
! 		XLogFileName(fname, curFileTLI, readSegNo);
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
  				(errcode_for_file_access(),
! 		 errmsg("could not seek in log segment %s to offset %u: %m",
! 				fname, readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  	{
! 		char fname[MAXFNAMELEN];
! 		XLogFileName(fname, curFileTLI, readSegNo);
! 		ereport(emode_for_corrupt_record(emode, *RecPtr),
  				(errcode_for_file_access(),
! 		 errmsg("could not read from log segment %s, offset %u: %m",
! 				fname, readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode, false))
  		goto next_record_is_invalid;
  
! 	readFileHeaderValidated = true;
  
! 	Assert(targetSegNo == readSegNo);
! 	Assert(targetPageOff == readOff);
! 	Assert(targetRecOff < readLen);
  
  	return true;
  
  next_record_is_invalid:
! 	failedSources |= readSource;
  
! 	if (readFile >= 0)
! 		close(readFile);
! 	readFile = -1;
! 	readLen = 0;
! 	readSource = 0;
  
  	/* In standby-mode, keep trying */
  	if (StandbyMode)
--- 8966,9034 ----
  		 * identification info that is present in the first page's "long"
  		 * header.
  		 */
! 		private->readOff = 0;
! 		if (read(private->readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  		{
! 			char		fname[MAXFNAMELEN];
! 
! 			XLogFileName(fname, curFileTLI, private->readSegNo);
! 			ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  					(errcode_for_file_access(),
! 				  errmsg("could not read from log segment %s, offset %u: %m",
! 						 fname, private->readOff)));
  			goto next_record_is_invalid;
  		}
! 		if (!ValidXLogPageHeader(private->readSegNo, private->readOff,
! 							   private->readSource, (XLogPageHeader) readBuf,
! 								 emode, true))
  			goto next_record_is_invalid;
  	}
  
  	/* Read the requested page */
! 	private->readOff = targetPageOff;
! 	if (lseek(private->readFile, (off_t) private->readOff, SEEK_SET) < 0)
  	{
! 		char		fname[MAXFNAMELEN];
! 
! 		XLogFileName(fname, curFileTLI, private->readSegNo);
! 		ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  				(errcode_for_file_access(),
! 				 errmsg("could not seek in log segment %s to offset %u: %m",
! 						fname, private->readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (read(private->readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
  	{
! 		char		fname[MAXFNAMELEN];
! 
! 		XLogFileName(fname, curFileTLI, private->readSegNo);
! 		ereport(emode_for_corrupt_record(emode, private->readSource, RecPtr),
  				(errcode_for_file_access(),
! 				 errmsg("could not read from log segment %s, offset %u: %m",
! 						fname, private->readOff)));
  		goto next_record_is_invalid;
  	}
! 	if (!ValidXLogPageHeader(private->readSegNo, private->readOff,
! 							 private->readSource, (XLogPageHeader) readBuf,
! 							 emode, false))
  		goto next_record_is_invalid;
  
! 	private->readFileHeaderValidated = true;
  
! 	Assert(targetSegNo == private->readSegNo);
! 	Assert(targetPageOff == private->readOff);
! 	Assert(targetRecOff < private->readLen);
  
  	return true;
  
  next_record_is_invalid:
! 	private->failedSources |= private->readSource;
  
! 	if (private->readFile >= 0)
! 		close(private->readFile);
! 	private->readFile = -1;
! 	private->readLen = 0;
! 	private->readSource = 0;
  
  	/* In standby-mode, keep trying */
  	if (StandbyMode)
***************
*** 9434,9444 **** next_record_is_invalid:
  		return false;
  
  triggered:
! 	if (readFile >= 0)
! 		close(readFile);
! 	readFile = -1;
! 	readLen = 0;
! 	readSource = 0;
  
  	return false;
  }
--- 9037,9047 ----
  		return false;
  
  triggered:
! 	if (private->readFile >= 0)
! 		close(private->readFile);
! 	private->readFile = -1;
! 	private->readLen = 0;
! 	private->readSource = 0;
  
  	return false;
  }
***************
*** 9455,9462 **** triggered:
   * false.
   */
  static bool
! WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
! 							bool fetching_ckpt)
  {
  	static pg_time_t last_fail_time = 0;
  
--- 9058,9065 ----
   * false.
   */
  static bool
! WaitForWALToBecomeAvailable(XLogPageReadPrivate *private, XLogRecPtr RecPtr,
! 							bool randAccess)
  {
  	static pg_time_t last_fail_time = 0;
  
***************
*** 9475,9481 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			 * the archive should be identical to what was streamed, so it's
  			 * unlikely that it helps, but one can hope...
  			 */
! 			if (failedSources & XLOG_FROM_STREAM)
  			{
  				ShutdownWalRcv();
  				continue;
--- 9078,9084 ----
  			 * the archive should be identical to what was streamed, so it's
  			 * unlikely that it helps, but one can hope...
  			 */
! 			if (private->failedSources & XLOG_FROM_STREAM)
  			{
  				ShutdownWalRcv();
  				continue;
***************
*** 9514,9534 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			if (havedata)
  			{
  				/*
! 				 * Great, streamed far enough.  Open the file if it's not open
  				 * already.  Use XLOG_FROM_STREAM so that source info is set
  				 * correctly and XLogReceiptTime isn't changed.
  				 */
! 				if (readFile < 0)
  				{
! 					readFile = XLogFileRead(readSegNo, PANIC,
! 											recoveryTargetTLI,
! 											XLOG_FROM_STREAM, false);
! 					Assert(readFile >= 0);
  				}
  				else
  				{
  					/* just make sure source info is correct... */
! 					readSource = XLOG_FROM_STREAM;
  					XLogReceiptSource = XLOG_FROM_STREAM;
  				}
  				break;
--- 9117,9138 ----
  			if (havedata)
  			{
  				/*
! 				 * Great, streamed far enough.	Open the file if it's not open
  				 * already.  Use XLOG_FROM_STREAM so that source info is set
  				 * correctly and XLogReceiptTime isn't changed.
  				 */
! 				if (private->readFile < 0)
  				{
! 					private->readFile =
! 						XLogFileRead(private, private->readSegNo, PANIC,
! 									 recoveryTargetTLI,
! 									 XLOG_FROM_STREAM, false);
! 					Assert(private->readFile >= 0);
  				}
  				else
  				{
  					/* just make sure source info is correct... */
! 					private->readSource = XLOG_FROM_STREAM;
  					XLogReceiptSource = XLOG_FROM_STREAM;
  				}
  				break;
***************
*** 9557,9566 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			int			sources;
  			pg_time_t	now;
  
! 			if (readFile >= 0)
  			{
! 				close(readFile);
! 				readFile = -1;
  			}
  			/* Reset curFileTLI if random fetch. */
  			if (randAccess)
--- 9161,9170 ----
  			int			sources;
  			pg_time_t	now;
  
! 			if (private->readFile >= 0)
  			{
! 				close(private->readFile);
! 				private->readFile = -1;
  			}
  			/* Reset curFileTLI if random fetch. */
  			if (randAccess)
***************
*** 9571,9582 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  			 * from pg_xlog.
  			 */
  			sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
! 			if (!(sources & ~failedSources))
  			{
  				/*
  				 * We've exhausted all options for retrieving the file. Retry.
  				 */
! 				failedSources = 0;
  
  				/*
  				 * Before we sleep, re-scan for possible new timelines if we
--- 9175,9186 ----
  			 * from pg_xlog.
  			 */
  			sources = XLOG_FROM_ARCHIVE | XLOG_FROM_PG_XLOG;
! 			if (!(sources & ~private->failedSources))
  			{
  				/*
  				 * We've exhausted all options for retrieving the file. Retry.
  				 */
! 				private->failedSources = 0;
  
  				/*
  				 * Before we sleep, re-scan for possible new timelines if we
***************
*** 9605,9634 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  				 * stream the missing WAL, before retrying to restore from
  				 * archive/pg_xlog.
  				 *
! 				 * If fetching_ckpt is TRUE, RecPtr points to the initial
! 				 * checkpoint location. In that case, we use RedoStartLSN as
! 				 * the streaming start position instead of RecPtr, so that
! 				 * when we later jump backwards to start redo at RedoStartLSN,
! 				 * we will have the logs streamed already.
  				 */
  				if (PrimaryConnInfo)
  				{
! 					XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
  
  					RequestXLogStreaming(ptr, PrimaryConnInfo);
  					continue;
  				}
  			}
  			/* Don't try to read from a source that just failed */
! 			sources &= ~failedSources;
! 			readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, sources);
! 			if (readFile >= 0)
  				break;
  
  			/*
  			 * Nope, not found in archive and/or pg_xlog.
  			 */
! 			failedSources |= sources;
  
  			/*
  			 * Check to see if the trigger file exists. Note that we do this
--- 9209,9240 ----
  				 * stream the missing WAL, before retrying to restore from
  				 * archive/pg_xlog.
  				 *
! 				 * If we're fetching a checkpoint record, RecPtr points to the
! 				 * initial checkpoint location. In that case, we use
! 				 * RedoStartLSN as the streaming start position instead of
! 				 * RecPtr, so that when we later jump backwards to start redo
! 				 * at RedoStartLSN, we will have the logs streamed already.
  				 */
  				if (PrimaryConnInfo)
  				{
! 					XLogRecPtr	ptr = private->fetching_ckpt ?
! 					RedoStartLSN : RecPtr;
  
  					RequestXLogStreaming(ptr, PrimaryConnInfo);
  					continue;
  				}
  			}
  			/* Don't try to read from a source that just failed */
! 			sources &= ~private->failedSources;
! 			private->readFile = XLogFileReadAnyTLI(private, private->readSegNo,
! 												   DEBUG2, sources);
! 			if (private->readFile >= 0)
  				break;
  
  			/*
  			 * Nope, not found in archive and/or pg_xlog.
  			 */
! 			private->failedSources |= sources;
  
  			/*
  			 * Check to see if the trigger file exists. Note that we do this
***************
*** 9668,9679 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
   * you are about to ereport(), or you might cause a later message to be
   * erroneously suppressed.
   */
! static int
! emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
  {
  	static XLogRecPtr lastComplaint = 0;
  
! 	if (readSource == XLOG_FROM_PG_XLOG && emode == LOG)
  	{
  		if (XLByteEQ(RecPtr, lastComplaint))
  			emode = DEBUG1;
--- 9274,9285 ----
   * you are about to ereport(), or you might cause a later message to be
   * erroneously suppressed.
   */
! int
! emode_for_corrupt_record(int emode, int source, XLogRecPtr RecPtr)
  {
  	static XLogRecPtr lastComplaint = 0;
  
! 	if (source == XLOG_FROM_PG_XLOG && emode == LOG)
  	{
  		if (XLByteEQ(RecPtr, lastComplaint))
  			emode = DEBUG1;
*** /dev/null
--- b/src/backend/access/transam/xlogreader.c
***************
*** 0 ****
--- 1,542 ----
+ /*-------------------------------------------------------------------------
+  *
+  * xlogreader.c
+  *		Generic xlog reading facility
+  *
+  * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *		src/backend/access/transam/xlogreader.c
+  *
+  * NOTES
+  *		Documentation about how do use this interface can be found in
+  *		xlogreader.h, more specifically in the definition of the
+  *		XLogReaderState struct where all parameters are documented.
+  *
+  * TODO:
+  * * usable without backend code around
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/transam.h"
+ #include "access/xlog_internal.h"
+ #include "access/xlogreader.h"
+ #include "catalog/pg_control.h"
+ 
+ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
+ static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ 					  XLogRecord *record, int emode, bool randAccess);
+ static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+ 
+ /*
+  * Allocate and initialize a new xlog reader
+  *
+  * Returns NULL if the xlogreader couldn't be allocated.
+  */
+ XLogReaderState *
+ XLogReaderAllocate(XLogRecPtr startpoint,
+ 				   XLogPageReadCB pagereadfunc, void *private_data)
+ {
+ 	XLogReaderState *state;
+ 
+ 	state = (XLogReaderState *) malloc(sizeof(XLogReaderState));
+ 	if (!state)
+ 		return NULL;
+ 	MemSet(state, 0, sizeof(XLogReaderState));
+ 
+ 	/*
+ 	 * Permanently allocate readBuf.  We do it this way, rather than just
+ 	 * making a static array, for two reasons: (1) no need to waste the
+ 	 * storage in most instantiations of the backend; (2) a static char array
+ 	 * isn't guaranteed to have any particular alignment, whereas malloc()
+ 	 * will provide MAXALIGN'd storage.
+ 	 */
+ 	state->readBuf = (char *) malloc(XLOG_BLCKSZ);
+ 	if (!state->readBuf)
+ 	{
+ 		pfree(state);
+ 		return NULL;
+ 	}
+ 
+ 	state->read_page = pagereadfunc;
+ 	state->private_data = private_data;
+ 	state->EndRecPtr = startpoint;
+ 
+ 	/*
+ 	 * Allocate an initial readRecordBuf of minimal size, which can later be
+ 	 * enlarged if necessary.
+ 	 */
+ 	if (!allocate_recordbuf(state, 0))
+ 	{
+ 		free(state->readBuf);
+ 		pfree(state);
+ 		return NULL;
+ 	}
+ 
+ 	return state;
+ }
+ 
+ void
+ XLogReaderFree(XLogReaderState *state)
+ {
+ 	if (state->readRecordBuf)
+ 		free(state->readRecordBuf);
+ 	free(state->readBuf);
+ 	pfree(state);
+ }
+ 
+ /*
+  * Allocate readRecordBuf to fit a record of at least the given length.
+  * Returns true if successful, false if out of memory.
+  *
+  * readRecordBufSize is set to the new buffer size.
+  *
+  * To avoid useless small increases, round its size to a multiple of
+  * XLOG_BLCKSZ, and make sure it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start
+  * with.  (That is enough for all "normal" records, but very large commit or
+  * abort records might need more space.)
+  */
+ static bool
+ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
+ {
+ 	uint32		newSize = reclength;
+ 
+ 	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+ 	newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+ 
+ 	if (state->readRecordBuf)
+ 		free(state->readRecordBuf);
+ 	state->readRecordBuf = (char *) malloc(newSize);
+ 	if (!state->readRecordBuf)
+ 	{
+ 		state->readRecordBufSize = 0;
+ 		return false;
+ 	}
+ 
+ 	state->readRecordBufSize = newSize;
+ 	return true;
+ }
+ 
+ /*
+  * Attempt to read an XLOG record.
+  *
+  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
+  * try to read a record just after the last one previously read.
+  *
+  * If no valid record is available, returns NULL, or fails if emode is PANIC.
+  * (emode must be either PANIC, LOG)
+  *
+  * The record is copied into readRecordBuf, so that on successful return,
+  * the returned record pointer always points there.
+  */
+ XLogRecord *
+ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+ {
+ 	XLogRecord *record;
+ 	XLogRecPtr	tmpRecPtr = state->EndRecPtr;
+ 	bool		randAccess = false;
+ 	uint32		len,
+ 				total_len;
+ 	uint32		targetRecOff;
+ 	uint32		pageHeaderSize;
+ 	bool		gotheader;
+ 
+ 	if (RecPtr == InvalidXLogRecPtr)
+ 	{
+ 		RecPtr = tmpRecPtr;
+ 
+ 		/*
+ 		 * RecPtr is pointing to end+1 of the previous WAL record.	If we're
+ 		 * at a page boundary, no more records can fit on the current page. We
+ 		 * must skip over the page header, but we can't do that until we've
+ 		 * read in the page, since the header size is variable.
+ 		 */
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * In this case, the passed-in record pointer should already be
+ 		 * pointing to a valid record starting position.
+ 		 */
+ 		if (!XRecOffIsValid(RecPtr))
+ 			ereport(PANIC,
+ 					(errmsg("invalid record offset at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		randAccess = true;		/* allow curFileTLI to go backwards too */
+ 	}
+ 
+ 	/* Read the page containing the record */
+ 	if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf,
+ 						  state->private_data))
+ 		return NULL;
+ 
+ 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ 	targetRecOff = RecPtr % XLOG_BLCKSZ;
+ 	if (targetRecOff == 0)
+ 	{
+ 		/*
+ 		 * At page start, so skip over page header.  The Assert checks that
+ 		 * we're not scribbling on caller's record pointer; it's OK because we
+ 		 * can only get here in the continuing-from-prev-record case, since
+ 		 * XRecOffIsValid rejected the zero-page-offset case otherwise. XXX:
+ 		 * does this assert make sense now that RecPtr is not a pointer?
+ 		 */
+ 		Assert(RecPtr == tmpRecPtr);
+ 		RecPtr += pageHeaderSize;
+ 		targetRecOff = pageHeaderSize;
+ 	}
+ 	else if (targetRecOff < pageHeaderSize)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid record offset at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+ 		targetRecOff == pageHeaderSize)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("contrecord is requested by %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 
+ 	/*
+ 	 * Read the record length.
+ 	 *
+ 	 * NB: Even though we use an XLogRecord pointer here, the whole record
+ 	 * header might not fit on this page. xl_tot_len is the first field of the
+ 	 * struct, so it must be on this page (the records are MAXALIGNed), but we
+ 	 * cannot access any other fields until we've verified that we got the
+ 	 * whole header.
+ 	 */
+ 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+ 	total_len = record->xl_tot_len;
+ 
+ 	/*
+ 	 * If the whole record header is on this page, validate it immediately.
+ 	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+ 	 * rest of the header after reading it from the next page.	The xl_tot_len
+ 	 * check is necessary here to ensure that we enter the "Need to reassemble
+ 	 * record" code path below; otherwise we might fail to apply
+ 	 * ValidXLogRecordHeader at all.
+ 	 */
+ 	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+ 	{
+ 		if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode,
+ 								   randAccess))
+ 			return NULL;
+ 		gotheader = true;
+ 	}
+ 	else
+ 	{
+ 		if (total_len < SizeOfXLogRecord)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("invalid record length at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return NULL;
+ 		}
+ 		gotheader = false;
+ 	}
+ 
+ 	/*
+ 	 * Enlarge readRecordBuf as needed.
+ 	 */
+ 	if (total_len > state->readRecordBufSize &&
+ 		!allocate_recordbuf(state, total_len))
+ 	{
+ 		/* We treat this as a "bogus data" condition */
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("record length %u at %X/%X too long",
+ 					  total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return NULL;
+ 	}
+ 
+ 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+ 	if (total_len > len)
+ 	{
+ 		/* Need to reassemble record */
+ 		char	   *contrecord;
+ 		XLogPageHeader pageHeader;
+ 		XLogRecPtr	pagelsn;
+ 		char	   *buffer;
+ 		uint32		gotlen;
+ 
+ 		/* Initialize pagelsn to the beginning of the page this record is on */
+ 		pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+ 
+ 		/* Copy the first fragment of the record from the first page. */
+ 		memcpy(state->readRecordBuf,
+ 			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+ 		buffer = state->readRecordBuf + len;
+ 		gotlen = len;
+ 
+ 		do
+ 		{
+ 			/* Calculate pointer to beginning of next page */
+ 			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+ 			/* Wait for the next page to become available */
+ 			if (!state->read_page(state, pagelsn, emode, false, state->readBuf,
+ 								  state->private_data))
+ 				return NULL;
+ 
+ 			/* Check that the continuation on next page looks valid */
+ 			pageHeader = (XLogPageHeader) state->readBuf;
+ 			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+ 			{
+ 				ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 						(errmsg("there is no contrecord flag at %X/%X",
+ 								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 				return NULL;
+ 			}
+ 
+ 			/*
+ 			 * Cross-check that xlp_rem_len agrees with how much of the record
+ 			 * we expect there to be left.
+ 			 */
+ 			if (pageHeader->xlp_rem_len == 0 ||
+ 				total_len != (pageHeader->xlp_rem_len + gotlen))
+ 			{
+ 				ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 						(errmsg("invalid contrecord length %u at %X/%X",
+ 								pageHeader->xlp_rem_len,
+ 								(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 				return NULL;
+ 			}
+ 
+ 			/* Append the continuation from this page to the buffer */
+ 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+ 			contrecord = (char *) state->readBuf + pageHeaderSize;
+ 			len = XLOG_BLCKSZ - pageHeaderSize;
+ 			if (pageHeader->xlp_rem_len < len)
+ 				len = pageHeader->xlp_rem_len;
+ 			memcpy(buffer, (char *) contrecord, len);
+ 			buffer += len;
+ 			gotlen += len;
+ 
+ 			/* If we just reassembled the record header, validate it. */
+ 			if (!gotheader)
+ 			{
+ 				record = (XLogRecord *) state->readRecordBuf;
+ 				if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record,
+ 										   emode, randAccess))
+ 					return NULL;
+ 				gotheader = true;
+ 			}
+ 		} while (pageHeader->xlp_rem_len > len);
+ 
+ 		record = (XLogRecord *) state->readRecordBuf;
+ 		if (!RecordIsValid(record, RecPtr, emode))
+ 			return NULL;
+ 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ 		state->ReadRecPtr = RecPtr;
+ 		state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+ 	}
+ 	else
+ 	{
+ 		/* Record does not cross a page boundary */
+ 		if (!RecordIsValid(record, RecPtr, emode))
+ 			return NULL;
+ 		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+ 
+ 		state->ReadRecPtr = RecPtr;
+ 		memcpy(state->readRecordBuf, record, total_len);
+ 	}
+ 
+ 	/*
+ 	 * Special processing if it's an XLOG SWITCH record
+ 	 */
+ 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ 	{
+ 		/* Pretend it extends to end of segment */
+ 		state->EndRecPtr += XLogSegSize - 1;
+ 		state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+ 	}
+ 
+ 	return record;
+ }
+ 
+ /*
+  * Validate an XLOG record header.
+  *
+  * This is just a convenience subroutine to avoid duplicated code in
+  * XLogReadRecord.	It's not intended for use from anywhere else.
+  */
+ static bool
+ ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ 					  XLogRecord *record, int emode, bool randAccess)
+ {
+ 	/*
+ 	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+ 	 * required.
+ 	 */
+ 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ 	{
+ 		if (record->xl_len != 0)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("invalid xlog switch record at %X/%X",
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 	else if (record->xl_len == 0)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("record with zero length at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+ 		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+ 		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid record length at %X/%X",
+ 						(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (record->xl_rmid > RM_MAX_ID)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 				(errmsg("invalid resource manager ID %u at %X/%X",
+ 						record->xl_rmid, (uint32) (RecPtr >> 32),
+ 						(uint32) RecPtr)));
+ 		return false;
+ 	}
+ 	if (randAccess)
+ 	{
+ 		/*
+ 		 * We can't exactly verify the prev-link, but surely it should be less
+ 		 * than the record's own address.
+ 		 */
+ 		if (!XLByteLT(record->xl_prev, RecPtr))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ 							(uint32) (record->xl_prev >> 32),
+ 							(uint32) record->xl_prev,
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Record's prev-link should exactly match our previous location. This
+ 		 * check guards against torn WAL pages where a stale but valid-looking
+ 		 * WAL record starts on a sector boundary.
+ 		 */
+ 		if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, RecPtr),
+ 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ 							(uint32) (record->xl_prev >> 32),
+ 							(uint32) record->xl_prev,
+ 							(uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ 			return false;
+ 		}
+ 	}
+ 
+ 	return true;
+ }
+ 
+ 
+ /*
+  * CRC-check an XLOG record.  We do not believe the contents of an XLOG
+  * record (other than to the minimal extent of computing the amount of
+  * data to read in) until we've checked the CRCs.
+  *
+  * We assume all of the record (that is, xl_tot_len bytes) has been read
+  * into memory at *record.	Also, ValidXLogRecordHeader() has accepted the
+  * record's header, which means in particular that xl_tot_len is at least
+  * SizeOfXlogRecord, so it is safe to fetch xl_len.
+  */
+ static bool
+ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+ {
+ 	pg_crc32	crc;
+ 	int			i;
+ 	uint32		len = record->xl_len;
+ 	BkpBlock	bkpb;
+ 	char	   *blk;
+ 	size_t		remaining = record->xl_tot_len;
+ 
+ 	/* First the rmgr data */
+ 	if (remaining < SizeOfXLogRecord + len)
+ 	{
+ 		/* ValidXLogRecordHeader() should've caught this already... */
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 				(errmsg("invalid record length at %X/%X",
+ 						(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 	remaining -= SizeOfXLogRecord + len;
+ 	INIT_CRC32(crc);
+ 	COMP_CRC32(crc, XLogRecGetData(record), len);
+ 
+ 	/* Add in the backup blocks, if any */
+ 	blk = (char *) XLogRecGetData(record) + len;
+ 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ 	{
+ 		uint32		blen;
+ 
+ 		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
+ 			continue;
+ 
+ 		if (remaining < sizeof(BkpBlock))
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("invalid backup block size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		memcpy(&bkpb, blk, sizeof(BkpBlock));
+ 
+ 		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("incorrect hole size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+ 
+ 		if (remaining < blen)
+ 		{
+ 			ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 					(errmsg("invalid backup block size in record at %X/%X",
+ 							(uint32) (recptr >> 32), (uint32) recptr)));
+ 			return false;
+ 		}
+ 		remaining -= blen;
+ 		COMP_CRC32(crc, blk, blen);
+ 		blk += blen;
+ 	}
+ 
+ 	/* Check that xl_tot_len agrees with our calculation */
+ 	if (remaining != 0)
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 				(errmsg("incorrect total length in record at %X/%X",
+ 						(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 
+ 	/* Finally include the record header */
+ 	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+ 	FIN_CRC32(crc);
+ 
+ 	if (!EQ_CRC32(record->xl_crc, crc))
+ 	{
+ 		ereport(emode_for_corrupt_record(emode, 0, recptr),
+ 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
+ 				(uint32) (recptr >> 32), (uint32) recptr)));
+ 		return false;
+ 	}
+ 
+ 	return true;
+ }
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 231,236 **** extern XLogRecPtr RequestXLogSwitch(void);
--- 231,244 ----
  
  extern void GetOldestRestartPoint(XLogRecPtr *oldrecptr, TimeLineID *oldtli);
  
+ 
+ /*
+  * Exported so that xlogreader.c can call this. TODO: Should be refactored
+  * into a callback, or just have xlogreader return the error string and have
+  * the caller of XLogReadRecord() do the ereport() call.
+  */
+ extern int	emode_for_corrupt_record(int emode, int readSource, XLogRecPtr RecPtr);
+ 
  /*
   * Exported for the functions in timeline.c and xlogarchive.c.  Only valid
   * in the startup process.
*** /dev/null
--- b/src/include/access/xlogreader.h
***************
*** 0 ****
--- 1,97 ----
+ /*-------------------------------------------------------------------------
+  *
+  * readxlog.h
+  *
+  *		Generic xlog reading facility.
+  *
+  * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *		src/include/access/xlogreader.h
+  *
+  * NOTES
+  *		Check the definition of the XLogReaderState struct for instructions on
+  *		how to use the XLogReader infrastructure.
+  *
+  *		The basic idea is to allocate an XLogReaderState via
+  *		XLogReaderAllocate, and call XLogReadRecord() until it returns NULL.
+  *-------------------------------------------------------------------------
+  */
+ #ifndef XLOGREADER_H
+ #define XLOGREADER_H
+ 
+ #include "access/xlog_internal.h"
+ 
+ struct XLogReaderState;
+ 
+ /*
+  * The callbacks are explained in more detail inside the XLogReaderState
+  * struct.
+  */
+ typedef bool (*XLogPageReadCB) (struct XLogReaderState *state,
+ 											XLogRecPtr RecPtr, int emode,
+ 											bool randAccess,
+ 											char *readBuf,
+ 											void *private_data);
+ 
+ typedef struct XLogReaderState
+ {
+ 	/* ----------------------------------------
+ 	 * Public parameters
+ 	 * ----------------------------------------
+ 	 */
+ 
+ 	/*
+ 	 * Data input callback (mandatory).
+ 	 *
+ 	 * This callback shall read XLOG_BLKSZ bytes, from the location 'RecPtr',
+ 	 * into memory pointed at by 'readBuf' parameter.  The callback shall
+ 	 * return true on success, false if the page could not be read.
+ 	 */
+ 	XLogPageReadCB read_page;
+ 
+ 	/*
+ 	 * Opaque data for callbacks to use.  Not used by XLogReader.
+ 	 */
+ 	void	   *private_data;
+ 
+ 	/*
+ 	 * From where to where are we reading
+ 	 */
+ 	XLogRecPtr	ReadRecPtr;		/* start of last record read */
+ 	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+ 
+ 	/* ----------------------------------------
+ 	 * private/internal state
+ 	 * ----------------------------------------
+ 	 */
+ 
+ 	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+ 	char	   *readBuf;
+ 
+ 	/* Buffer for current ReadRecord result (expandable) */
+ 	char	   *readRecordBuf;
+ 	uint32		readRecordBufSize;
+ } XLogReaderState;
+ 
+ /*
+  * Get a new XLogReader
+  *
+  * At least the read_page callback, startptr and endptr have to be set before
+  * the reader can be used.
+  */
+ extern XLogReaderState *XLogReaderAllocate(XLogRecPtr startpoint,
+ 				   XLogPageReadCB pagereadfunc, void *private_data);
+ 
+ /*
+  * Free an XLogReader
+  */
+ extern void XLogReaderFree(XLogReaderState *state);
+ 
+ /*
+  * Read the next record from xlog. Returns NULL on end-of-WAL or on failure.
+  */
+ extern XLogRecord *XLogReadRecord(XLogReaderState *state, XLogRecPtr ptr,
+ 			   int emode);
+ 
+ #endif   /* XLOGREADER_H */

#77

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Simon Riggs (#21)

Re: [PATCH 08/14] Store the number of subtransactions in xl_running_xacts separately from toplevel xids

On 15 November 2012 12:07, Simon Riggs <simon@2ndquadrant.com> wrote:

On 14 November 2012 22:17, Andres Freund <andres@2ndquadrant.com> wrote:

To avoid complicating logic we store both, the toplevel and the subxids, in
->xip, first ->xcnt toplevel ones, and then ->subxcnt subxids.

That looks good, not much change. Will apply in next few days. Please
add me as committer and mark ready.

I tried improving this, but couldn't. So I've committed it as is.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Andres Freund (#12)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

On 12-11-14 08:17 PM, Andres Freund wrote:

I am getting errors like the following when I try to use either your
test_decoding plugin or my own (which does even less than yours)

LOG: database system is ready to accept connections
LOG: autovacuum launcher started
WARNING: connecting to
WARNING: Initiating logical rep
LOG: computed new xmin: 773
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: got new xmin 773 at 25124280
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: found initial snapshot (via running xacts). Done: 1
FATAL: cannot read pg_class without having selected a database
TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))",
File: "proc.c", Line: 759)

This seems to be happening under the calls at
reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid,
&change->relnode))

The sequence of events I do is:
1. start pg_receivellog
2. run a checkpoint
3. Attach to the walsender process with gdb
4. Start a new client connection with psql and do 'INSERT INTO a values
(1)' twice.

(skipping step 3 doesn't make a difference)

Show quoted text

This introduces several things:
* 'reorderbuffer' module which reassembles transactions from a stream of interspersed changes
* 'snapbuilder' which builds catalog snapshots so that tuples from wal can be understood
* logging more data into wal to facilitate logical decoding
* wal decoding into an reorderbuffer
* shared library output plugins with 5 callbacks
* init
* begin
* change
* commit
* walsender infrastructur to stream out changes and to keep the global xmin low enough
* INIT_LOGICAL_REPLICATION $plugin; waits till a consistent snapshot is built and returns
* initial LSN
* replication slot identifier
* id of a pg_export() style snapshot
* START_LOGICAL_REPLICATION $id $lsn; streams out changes
* uses named output plugins for output specification

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning
transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after
INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
outside.
* more frequent xl_running_xid's so xmin can be upped more frequently
* add STOP_LOGICAL_REPLICATION $id
---
src/backend/access/heap/heapam.c | 280 +++++-
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 ++
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/decode.c | 496 ++++++++++
src/backend/replication/logical/logicalfuncs.c | 247 +++++
src/backend/replication/logical/reorderbuffer.c | 1156 +++++++++++++++++++++++
src/backend/replication/logical/snapbuild.c | 1144 ++++++++++++++++++++++
src/backend/replication/repl_gram.y | 32 +-
src/backend/replication/repl_scanner.l | 2 +
src/backend/replication/walsender.c | 566 ++++++++++-
src/backend/storage/ipc/procarray.c | 23 +
src/backend/storage/ipc/standby.c | 8 +-
src/backend/utils/cache/inval.c | 2 +-
src/backend/utils/cache/relcache.c | 3 +-
src/backend/utils/misc/guc.c | 11 +
src/backend/utils/time/tqual.c | 249 +++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/heapam_xlog.h | 23 +
src/include/access/transam.h | 5 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
src/include/nodes/nodes.h | 2 +
src/include/nodes/replnodes.h | 22 +
src/include/replication/decode.h | 21 +
src/include/replication/logicalfuncs.h | 44 +
src/include/replication/output_plugin.h | 76 ++
src/include/replication/reorderbuffer.h | 284 ++++++
src/include/replication/snapbuild.h | 128 +++
src/include/replication/walsender.h | 1 +
src/include/replication/walsender_private.h | 34 +-
src/include/storage/itemptr.h | 3 +
src/include/storage/sinval.h | 2 +
src/include/utils/tqual.h | 31 +-
35 files changed, 4966 insertions(+), 34 deletions(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/reorderbuffer.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/logicalfuncs.h
create mode 100644 src/include/replication/output_plugin.h
create mode 100644 src/include/replication/reorderbuffer.h
create mode 100644 src/include/replication/snapbuild.h

#79

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Steve Singer (#78)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

Hi Steve,

On 2012-12-02 21:52:08 -0500, Steve Singer wrote:

On 12-11-14 08:17 PM, Andres Freund wrote:

I am getting errors like the following when I try to use either your
test_decoding plugin or my own (which does even less than yours)

LOG: database system is ready to accept connections
LOG: autovacuum launcher started
WARNING: connecting to
WARNING: Initiating logical rep
LOG: computed new xmin: 773
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: got new xmin 773 at 25124280
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: found initial snapshot (via running xacts). Done: 1
FATAL: cannot read pg_class without having selected a database
TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
"proc.c", Line: 759)

This seems to be happening under the calls at
reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid,
&change->relnode))

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?
2) Could you check that you really have a fully clean build? That has
hit me in the past as well.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#79)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

On 2012-12-03 13:22:12 +0100, Andres Freund wrote:

Hi Steve,

On 2012-12-02 21:52:08 -0500, Steve Singer wrote:

On 12-11-14 08:17 PM, Andres Freund wrote:

I am getting errors like the following when I try to use either your
test_decoding plugin or my own (which does even less than yours)

LOG: database system is ready to accept connections
LOG: autovacuum launcher started
WARNING: connecting to
WARNING: Initiating logical rep
LOG: computed new xmin: 773
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: got new xmin 773 at 25124280
LOG: found initial snapshot (via running xacts). Done: 1
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication
LOG: start reading from 0/17F5D58, scrolled back to 0/17F4000
LOG: found initial snapshot (via running xacts). Done: 1
FATAL: cannot read pg_class without having selected a database
TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File:
"proc.c", Line: 759)

This seems to be happening under the calls at
reorderbuffer.c:832 if (!SnapBuildHasCatalogChanges(NULL, xid,
&change->relnode))

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?

This seems to produce exactly that kind off error messages. I added some
error checking around that. Pushed.

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Andres Freund (#80)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

On 12-12-03 07:42 AM, Andres Freund wrote:

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?

Yes that is exactly what I'md doing. Using a real database name instead
makes this go away.

Thanks

This seems to produce exactly that kind off error messages. I added some
error checking around that. Pushed.

Thanks!

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Steve Singer (#81)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

On 2012-12-03 09:35:55 -0500, Steve Singer wrote:

On 12-12-03 07:42 AM, Andres Freund wrote:

Two things:
1) Which exact options are you using for pg_receivellog? Not "-d
replication" by any chance?

Yes that is exactly what I'md doing. Using a real database name instead
makes this go away.

Was using "replication" an accident or do you think it makes sense in
some way?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Steve Singer

steve@ssinger.info

about 13 years ago

In reply to: Andres Freund (#82)

Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel

On 12-12-03 09:48 AM, Andres Freund wrote:

On 2012-12-03 09:35:55 -0500, Steve Singer wrote:

On 12-12-03 07:42 AM, Andres Freund wrote:
Yes that is exactly what I'md doing. Using a real database name
instead makes this go away.

Was using "replication" an accident or do you think it makes sense in
some way?

The 'replication' line in pg_hba.conf made me think that the database
name had to be replication for walsender connections.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Peter Eisentraut (#30)

Re: [PATCH 03/14] Add simple xlogdump tool

Hi,

I tried to address most (all?) your comments in the version from
http://archives.postgresql.org/message-id/20121204175212.GB12055%40awork2.anarazel.de
.

On 2012-11-15 11:31:55 -0500, Peter Eisentraut wrote:

+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
This looks pretty evil, and there is no documentation about what it is
supposed to do.

Windows build support needs some thought.

Ok, since Alvaro made it possible the build now only has rules like:

xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/%
rm -f $@ && $(LN_S) $< .

clogdesc.c: % : $(top_srcdir)/src/backend/access/rmgrdesc/%
rm -f $@ && $(LN_S) $< .
and
OBJS = \
clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o assert.o \
$(WIN32RES) \
pg_xlogdump.o pqexpbuf_strinfo.o compat.o tables.o xlogreader.o \

pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport
$(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) $(libpq_pgport) -o $@$(X)

Thats easier easier to integrate into the windows build?

+static void
+usage(void)
+{
+	printf(_("%s reads/writes postgres transaction logs for debugging.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -v, --version          output version information, then exit\n"));
+	printf(_("  -h, --help             show this help, then exit\n"));
+	printf(_("  -s, --start            from where recptr onwards to read\n"));
+	printf(_("  -e, --end              up to which recptr to read\n"));
+	printf(_("  -t, --timeline         which timeline do we want to read\n"));
+	printf(_("  -i, --inpath           from where do we want to read? cwd/pg_xlog is the default\n"));
+	printf(_("  -o, --output           where to write [start, end]\n"));
+	printf(_("  -f, --file             wal file to parse\n"));
+}

Options list should be in alphabetic order (or some other less random
order). Most of these descriptions are not very intelligible (at least
without additional documentation).

I tried to improve the help, its now:

pg_xlogdump: reads/writes postgres transaction logs for debugging.

Usage:
pg_xlogdump [OPTION]...

Options:
-b, --bkp-details output detailed information about backup blocks
-e, --end RECPTR read wal up to RECPTR
-f, --file FILE wal file to parse, cannot be specified together with -p
-h, --help show this help, then exit
-p, --path PATH from where do we want to read? cwd/pg_xlog is the default
-s, --start RECPTR read wal in directory indicated by -p starting at RECPTR
-t, --timeline TLI which timeline do we want to read, defaults to 1
-v, --version output version information, then exit

I wonder whether it would make sense to split the help into different
sections? It seems likely we will gain some more options...

no nls.mk

Do I need to do anything for that besides:

# src/bin/pg_xlogdump/nls.mk
CATALOG_NAME = pg_xlogdump
AVAIL_LANGUAGES =
GETTEXT_FILES = pg_xlogdump.c

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#16)

Re: logical changeset generation v3 - git repository

On 2012-11-15 02:26:53 +0100, Andres Freund wrote:

On 2012-11-15 01:27:46 +0100, Andres Freund wrote:

In response to this you will soon find the 14 patches that currently
implement $subject.

As its not very wieldly to send around that many/big patches all the
time, until the next "major" version I will just update the git tree at:

Web:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf3

Git:
git clone git://git.postgresql.org/git/users/andresfreund/postgres.git xlog-decoding-rebasing-cf3

I pushed a new version which

- is rebased ontop of master
- is based ontop of the new xlogreader (biggest part)
- is base ontop of the new binaryheap.h
- some fixes
- some more comments

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#23)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-11-15 16:22:56 +0200, Heikki Linnakangas wrote:

On 15.11.2012 03:17, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer

I'm disappointed to see that there has been no progress on this patch since
last commitfest. I thought we agreed on the approach I championed for here:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php. There
wasn't much work left to finish that, I believe.

Are you going to continue working on this?

Patch (git repo) is now based ontop of my vesion of your xlogreader
version...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#57)

Re: logical changeset generation v3

Hi,

On 2012-11-19 09:50:30 +0100, Andres Freund wrote:

On 2012-11-19 16:28:55 +0900, Michael Paquier wrote:

After launching some SQLs, the logical receiver is stuck just after sending
INIT_LOGICAL_REPLICATION, please see bt of process waiting:

Its waiting till it sees initial an initial xl_running_xacts record. The
easiest way to do that is manually issue a checkpoint. Sorry, should
have included that in the description.
Otherwise you can wait till the next routine checkpoint comes arround...

I plan to cause more xl_running_xacts record to be logged in the
future. I think the timing of those currently is non-optimal, you have
the same problem as here in normal streaming replication as well :(

This is "fixed" now with the changes I pushed a second ago. Unless some
longrunning transactions are arround we now immediately jump to a ready
state.
This is achieved by
1. regularly logging xl_running_xacts (in background writer)
2. logging xl_runnign_xacts at the beginning of INIT_LOGICAL_REPLICATION

This also has the advantage that the xmin horizon can be increased much
more frequently.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Andres Freund (#86)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

(Offlist)

Just a quick note that I'm working on this patch now. I pushed some
trivial fixes to my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3
branch.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Heikki Linnakangas (#88)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 10.12.2012 22:22, Heikki Linnakangas wrote:

(Offlist)

Just a quick note that I'm working on this patch now. I pushed some
trivial fixes to my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, xlogreader_v3
branch.

Oops, wasn't offlist :-). Well, if anyone wants to take a look, feel free.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Andres Freund (#86)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

I've been molding this patch for a while now, here's what I have this
far (also available in my git repository).

The biggest change is in the error reporting. A stand-alone program that
wants to use xlogreader.c no longer has to provide a full-blown
replacement for ereport(). The only thing that xlogreader.c used
ereport() was when it encounters an invalid record. And even there we
had the emode_for_corrupt_record hack. I think it's a much better API
that XLogReadRecord just returns NULL on an invalid record, and an error
string, and the caller can do what it wants with that. In xlog.c, we'll
pass the error string to ereport(), with the right emode as determined
by emode_for_corrupt_record. xlog.c is no longer concerned with
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically
insane to imagine that you can carve out a non-trivial piece of the
backend that doesn't contain any elog calls."
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
having done just that, it doesn't seem insane to me. xlogreader.c really
is a pretty well contained piece of code. All the complicated stuff that
contains elog calls and pallocs and more is in the callback, which can
freely use all the normal backend infrastructure.

Now, here's some stuff that still need to be done:

* A stand-alone program using xlogreader.c has to provide an
implementation of tliInHistory(). Need to find a better way to do that.
Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.

* In xlog.c, some of the variables that used to be statics like
readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
there's still plenty of statics left in there - it would certainly not
work correctly if xlog.c tried to open two xlog files at the same time.
I think it's just confusing to have some stuff in the
XLogPageReadPrivate struct, and others as static, so I think we should
get rid of XLogPageReadPrivate struct altogether and put back the static
variables. At least it would make the diff smaller, which might help
with reviewing. xlog.c probably doesn't need to provide a "private"
struct to xlogreader.c at all, which is okay.

* It's pretty ugly that to use the rm_desc functions, you have to
provide dummy implementations of a bunch of backend functions, including
pfree() and timestamptz_to_str(). Should find a better way to do that.

* It's not clear to me how we'd handle translating the strings in
xlogreader.c, when xlogreader.c is used in a stand-alone program like
pg_xlogdump. Maybe we can just punt on that...

* How about we move pg_xlogdump to contrib? It doesn't feel like the
kind of essential tool that deserves to be in src/bin.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Heikki Linnakangas (#90)

1 attachment(s)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Forgot attachment..

On 11.12.2012 15:55, Heikki Linnakangas wrote:

I've been molding this patch for a while now, here's what I have this
far (also available in my git repository).

The biggest change is in the error reporting. A stand-alone program that
wants to use xlogreader.c no longer has to provide a full-blown
replacement for ereport(). The only thing that xlogreader.c used
ereport() was when it encounters an invalid record. And even there we
had the emode_for_corrupt_record hack. I think it's a much better API
that XLogReadRecord just returns NULL on an invalid record, and an error
string, and the caller can do what it wants with that. In xlog.c, we'll
pass the error string to ereport(), with the right emode as determined
by emode_for_corrupt_record. xlog.c is no longer concerned with
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically
insane to imagine that you can carve out a non-trivial piece of the
backend that doesn't contain any elog calls."
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
having done just that, it doesn't seem insane to me. xlogreader.c really
is a pretty well contained piece of code. All the complicated stuff that
contains elog calls and pallocs and more is in the callback, which can
freely use all the normal backend infrastructure.

Now, here's some stuff that still need to be done:

* A stand-alone program using xlogreader.c has to provide an
implementation of tliInHistory(). Need to find a better way to do that.
Perhaps "#ifndef FRONTEND" the tliInHistory checks in xlogreader.

* In xlog.c, some of the variables that used to be statics like
readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
there's still plenty of statics left in there - it would certainly not
work correctly if xlog.c tried to open two xlog files at the same time.
I think it's just confusing to have some stuff in the
XLogPageReadPrivate struct, and others as static, so I think we should
get rid of XLogPageReadPrivate struct altogether and put back the static
variables. At least it would make the diff smaller, which might help
with reviewing. xlog.c probably doesn't need to provide a "private"
struct to xlogreader.c at all, which is okay.

* It's pretty ugly that to use the rm_desc functions, you have to
provide dummy implementations of a bunch of backend functions, including
pfree() and timestamptz_to_str(). Should find a better way to do that.

* It's not clear to me how we'd handle translating the strings in
xlogreader.c, when xlogreader.c is used in a stand-alone program like
pg_xlogdump. Maybe we can just punt on that...

* How about we move pg_xlogdump to contrib? It doesn't feel like the
kind of essential tool that deserves to be in src/bin.

- Heikki

--
- Heikki

Attachments:

xlogreader-heikki-20121211.patchtext/x-diff; name=xlogreader-heikki-20121211.patchDownload

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index df84054..49cb7ac 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -178,6 +178,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY pgReceivexlog      SYSTEM "pg_receivexlog.sgml">
 <!ENTITY pgResetxlog        SYSTEM "pg_resetxlog.sgml">
 <!ENTITY pgRestore          SYSTEM "pg_restore.sgml">
+<!ENTITY pgXlogdump         SYSTEM "pg_xlogdump.sgml">
 <!ENTITY postgres           SYSTEM "postgres-ref.sgml">
 <!ENTITY postmaster         SYSTEM "postmaster.sgml">
 <!ENTITY psqlRef            SYSTEM "psql-ref.sgml">
diff --git a/doc/src/sgml/ref/pg_xlogdump.sgml b/doc/src/sgml/ref/pg_xlogdump.sgml
new file mode 100644
index 0000000..7a27c7b
--- /dev/null
+++ b/doc/src/sgml/ref/pg_xlogdump.sgml
@@ -0,0 +1,76 @@
+<!--
+doc/src/sgml/ref/pg_xlogdump.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="APP-PGXLOGDUMP">
+ <refmeta>
+  <refentrytitle><application>pg_xlogdump</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_xlogdump</refname>
+  <refpurpose>Display the write-ahead log of a <productname>PostgreSQL</productname> database cluster</refpurpose>
+ </refnamediv>
+
+ <indexterm zone="app-pgxlogdump">
+  <primary>pg_xlogdump</primary>
+ </indexterm>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_xlogdump</command>
+   <arg choice="opt"><option>-b</option></arg>
+   <arg choice="opt"><option>-e</option> <replaceable class="parameter">xlogrecptr</replaceable></arg>
+   <arg choice="opt"><option>-f</option> <replaceable class="parameter">filename</replaceable></arg>
+   <arg choice="opt"><option>-h</option></arg>
+   <arg choice="opt"><option>-p</option> <replaceable class="parameter">directory</replaceable></arg>
+   <arg choice="opt"><option>-s</option> <replaceable class="parameter">xlogrecptr</replaceable></arg>
+   <arg choice="opt"><option>-t</option> <replaceable class="parameter">timelineid</replaceable></arg>
+   <arg choice="opt"><option>-v</option></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1 id="R1-APP-PGXLOGDUMP-1">
+  <title>Description</title>
+  <para>
+   <command>pg_xlogdump</command> display the write-ahead log (WAL) and is only
+   useful for debugging or educational purposes.
+  </para>
+
+  <para>
+   This utility can only be run by the user who installed the server, because
+   it requires read access to the data directory. It does not perform any
+   modifications.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+   <para>
+    The following command-line options control the location and format of the
+    output.
+
+    <variablelist>
+     <varlistentry>
+      <term><option>-p <replaceable class="parameter">directory</replaceable></option></term>
+      <listitem>
+       <para>
+        Directory to find xlog files in.
+       </para>
+      </listitem>
+     </varlistentry>
+    </variablelist>
+   </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+  <para>
+    Can give wrong results when the server is running.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index 0872168..fed1fdd 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -225,6 +225,7 @@
    &pgDumpall;
    &pgReceivexlog;
    &pgRestore;
+   &pgXlogdump;
    &psqlRef;
    &reindexdb;
    &vacuumdb;
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 700cfd8..eb6cfc5 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
 	timeline.o twophase.o twophase_rmgr.o xlog.o xlogarchive.o xlogfuncs.o \
-	xlogutils.o
+	xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2618c8d..4c36468 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -30,6 +30,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/xlogreader.h"
 #include "access/xlogutils.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -531,29 +532,33 @@ static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
 
 /*
- * These variables are used similarly to the ones above, but for reading
+ * Status data for XLogPageRead.
+ *
+ * The first three are used similarly to the ones above, but for reading
  * the XLOG.  Note, however, that readOff generally represents the offset
  * of the page just read, not the seek position of the FD itself, which
  * will be just past that page. readLen indicates how much of the current
  * page has been read into readBuf, and readSource indicates where we got
  * the currently open file from.
+ *
+ * currentSource keeps track of which source we're currently reading from. This
+ * is different from readSource in that this is always set, even when we don't
+ * currently have a WAL file open. If lastSourceFailed is set, our last attempt
+ * to read from currentSource failed, and we should try another source next.
  */
-static int	readFile = -1;
-static XLogSegNo readSegNo = 0;
-static uint32 readOff = 0;
-static uint32 readLen = 0;
-static bool	readFileHeaderValidated = false;
-static XLogSource readSource = 0;		/* XLOG_FROM_* code */
-
-/*
- * Keeps track of which source we're currently reading from. This is
- * different from readSource in that this is always set, even when we don't
- * currently have a WAL file open. If lastSourceFailed is set, our last
- * attempt to read from currentSource failed, and we should try another source
- * next.
- */
-static XLogSource currentSource = 0;	/* XLOG_FROM_* code */
-static bool	lastSourceFailed = false;
+typedef struct XLogPageReadPrivate
+{
+	int			emode;
+
+	int			readFile;
+	XLogSegNo	readSegNo;
+	uint32		readOff;
+	uint32		readLen;
+	bool		fetching_ckpt;	/* are we fetching a checkpoint record? */
+	XLogSource	readSource;		/* XLOG_FROM_* code */
+	XLogSource	currentSource;	/* XLOG_FROM_* code */
+	bool		lastSourceFailed;
+} XLogPageReadPrivate;
 
 /*
  * These variables track when we last obtained some WAL data to process,
@@ -566,18 +571,9 @@ static bool	lastSourceFailed = false;
 static TimestampTz XLogReceiptTime = 0;
 static XLogSource XLogReceiptSource = 0;	/* XLOG_FROM_* code */
 
-/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
-static char *readBuf = NULL;
-
-/* Buffer for current ReadRecord result (expandable) */
-static char *readRecordBuf = NULL;
-static uint32 readRecordBufSize = 0;
-
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;	/* start of last record read */
 static XLogRecPtr EndRecPtr;	/* end+1 of last record read */
-static TimeLineID lastPageTLI = 0;
-static TimeLineID lastSegmentTLI = 0;
 
 static XLogRecPtr minRecoveryPoint;		/* local copy of
 										 * ControlFile->minRecoveryPoint */
@@ -598,7 +594,8 @@ static bool bgwriterLaunched = false;
 
 
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
+static void exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
+					XLogSegNo endLogSegNo);
 static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
 static void recoveryPausesHere(void);
 static void SetLatestXTime(TimestampTz xtime);
@@ -617,14 +614,15 @@ static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, int *max_advance,
 					   bool use_lock);
-static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notexistOk);
-static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
-static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess);
-static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt);
-static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static int XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno,
+			 int emode, TimeLineID tli, int source, bool notexistOk);
+static int XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno,
+							  int emode, int source);
+static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
+				 int reqLen, char *readBuf, TimeLineID *readTLI);
+static bool WaitForWALToBecomeAvailable(XLogPageReadPrivate *private,
+										XLogRecPtr RecPtr);
+static int emode_for_corrupt_record(XLogReaderState *state, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
@@ -632,12 +630,11 @@ static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
+static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+		   int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly);
-static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
-					  int emode, bool randAccess);
-static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
+static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader,
+					 XLogRecPtr RecPtr, int whichChkpt);
 static bool rescanLatestTimeLine(void);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
@@ -2577,8 +2574,8 @@ XLogFileOpen(XLogSegNo segno)
  * Otherwise, it's assumed to be already available in pg_xlog.
  */
 static int
-XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk)
+XLogFileRead(XLogPageReadPrivate *private, XLogSegNo segno, int emode,
+			 TimeLineID tli, int source, bool notfoundOk)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -2695,15 +2692,12 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		set_ps_display(activitymsg, false);
 
 		/* Track source of data in assorted state variables */
-		readSource = source;
+		private->readSource = source;
 		XLogReceiptSource = source;
 		/* In FROM_STREAM case, caller tracks receipt time, not me */
 		if (source != XLOG_FROM_STREAM)
 			XLogReceiptTime = GetCurrentTimestamp();
 
-		/* The file header needs to be validated on first access */
-		readFileHeaderValidated = false;
-
 		return fd;
 	}
 	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
@@ -2719,7 +2713,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
+XLogFileReadAnyTLI(XLogPageReadPrivate *private, XLogSegNo segno,
+				   int emode, int source)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -2744,7 +2739,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
-			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true);
+			fd = XLogFileRead(private, segno, emode, tli,
+							  XLOG_FROM_ARCHIVE, true);
 			if (fd != -1)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
@@ -2754,7 +2750,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_XLOG)
 		{
-			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_XLOG, true);
+			fd = XLogFileRead(private, segno, emode, tli,
+							  XLOG_FROM_PG_XLOG, true);
 			if (fd != -1)
 				return fd;
 		}
@@ -3187,102 +3184,6 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
 }
 
 /*
- * CRC-check an XLOG record.  We do not believe the contents of an XLOG
- * record (other than to the minimal extent of computing the amount of
- * data to read in) until we've checked the CRCs.
- *
- * We assume all of the record (that is, xl_tot_len bytes) has been read
- * into memory at *record.  Also, ValidXLogRecordHeader() has accepted the
- * record's header, which means in particular that xl_tot_len is at least
- * SizeOfXlogRecord, so it is safe to fetch xl_len.
- */
-static bool
-RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
-{
-	pg_crc32	crc;
-	int			i;
-	uint32		len = record->xl_len;
-	BkpBlock	bkpb;
-	char	   *blk;
-	size_t		remaining = record->xl_tot_len;
-
-	/* First the rmgr data */
-	if (remaining < SizeOfXLogRecord + len)
-	{
-		/* ValidXLogRecordHeader() should've caught this already... */
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-	remaining -= SizeOfXLogRecord + len;
-	INIT_CRC32(crc);
-	COMP_CRC32(crc, XLogRecGetData(record), len);
-
-	/* Add in the backup blocks, if any */
-	blk = (char *) XLogRecGetData(record) + len;
-	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
-	{
-		uint32		blen;
-
-		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
-			continue;
-
-		if (remaining < sizeof(BkpBlock))
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		memcpy(&bkpb, blk, sizeof(BkpBlock));
-
-		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("incorrect hole size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
-
-		if (remaining < blen)
-		{
-			ereport(emode_for_corrupt_record(emode, recptr),
-					(errmsg("invalid backup block size in record at %X/%X",
-							(uint32) (recptr >> 32), (uint32) recptr)));
-			return false;
-		}
-		remaining -= blen;
-		COMP_CRC32(crc, blk, blen);
-		blk += blen;
-	}
-
-	/* Check that xl_tot_len agrees with our calculation */
-	if (remaining != 0)
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-				(errmsg("incorrect total length in record at %X/%X",
-						(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	/* Finally include the record header */
-	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
-	FIN_CRC32(crc);
-
-	if (!EQ_CRC32(record->xl_crc, crc))
-	{
-		ereport(emode_for_corrupt_record(emode, recptr),
-		(errmsg("incorrect resource manager data checksum in record at %X/%X",
-				(uint32) (recptr >> 32), (uint32) recptr)));
-		return false;
-	}
-
-	return true;
-}
-
-/*
  * Attempt to read an XLOG record.
  *
  * If RecPtr is not NULL, try to read a record at that position.  Otherwise
@@ -3295,511 +3196,41 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
  * the returned record pointer always points there.
  */
 static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
+		   bool fetching_ckpt)
 {
 	XLogRecord *record;
-	XLogRecPtr	tmpRecPtr = EndRecPtr;
-	bool		randAccess = false;
-	uint32		len,
-				total_len;
-	uint32		targetRecOff;
-	uint32		pageHeaderSize;
-	bool		gotheader;
-
-	if (readBuf == NULL)
-	{
-		/*
-		 * First time through, permanently allocate readBuf.  We do it this
-		 * way, rather than just making a static array, for two reasons: (1)
-		 * no need to waste the storage in most instantiations of the backend;
-		 * (2) a static char array isn't guaranteed to have any particular
-		 * alignment, whereas malloc() will provide MAXALIGN'd storage.
-		 */
-		readBuf = (char *) malloc(XLOG_BLCKSZ);
-		Assert(readBuf != NULL);
-	}
-
-	if (RecPtr == NULL)
-	{
-		RecPtr = &tmpRecPtr;
-
-		/*
-		 * RecPtr is pointing to end+1 of the previous WAL record.  If
-		 * we're at a page boundary, no more records can fit on the current
-		 * page. We must skip over the page header, but we can't do that
-		 * until we've read in the page, since the header size is variable.
-		 */
-	}
-	else
-	{
-		/*
-		 * In this case, the passed-in record pointer should already be
-		 * pointing to a valid record starting position.
-		 */
-		if (!XRecOffIsValid(*RecPtr))
-			ereport(PANIC,
-					(errmsg("invalid record offset at %X/%X",
-							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
+	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
-		/*
-		 * Since we are going to a random position in WAL, forget any prior
-		 * state about what timeline we were in, and allow it to be any
-		 * timeline in expectedTLEs.  We also set a flag to allow curFileTLI
-		 * to go backwards (but we can't reset that variable right here, since
-		 * we might not change files at all).
-		 */
-		/* see comment in ValidXLogPageHeader */
-		lastPageTLI = lastSegmentTLI = 0;
-		randAccess = true;		/* allow curFileTLI to go backwards too */
-	}
+	/* Pass parameters to XLogPageRead */
+	private->fetching_ckpt = fetching_ckpt;
+	private->emode = emode;
 
 	/* This is the first try to read this page. */
-	lastSourceFailed = false;
-retry:
-	/* Read the page containing the record */
-	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
-		return NULL;
-
-	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
-	if (targetRecOff == 0)
-	{
-		/*
-		 * At page start, so skip over page header.  The Assert checks that
-		 * we're not scribbling on caller's record pointer; it's OK because we
-		 * can only get here in the continuing-from-prev-record case, since
-		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
-		 */
-		Assert(RecPtr == &tmpRecPtr);
-		(*RecPtr) += pageHeaderSize;
-		targetRecOff = pageHeaderSize;
-	}
-	else if (targetRecOff < pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record offset at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
-		targetRecOff == pageHeaderSize)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("contrecord is requested by %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		goto next_record_is_invalid;
-	}
-
-	/*
-	 * Read the record length.
-	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of
-	 * the struct, so it must be on this page (the records are MAXALIGNed),
-	 * but we cannot access any other fields until we've verified that we
-	 * got the whole header.
-	 */
-	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
-
-	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
-	 * check is necessary here to ensure that we enter the "Need to reassemble
-	 * record" code path below; otherwise we might fail to apply
-	 * ValidXLogRecordHeader at all.
-	 */
-	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
-	{
-		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-			goto next_record_is_invalid;
-		gotheader = true;
-	}
-	else
-	{
-		if (total_len < SizeOfXLogRecord)
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid record length at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		gotheader = false;
-	}
+	private->lastSourceFailed = false;
 
-	/*
-	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
-	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
-	 * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with.  (That is
-	 * enough for all "normal" records, but very large commit or abort records
-	 * might need more space.)
-	 */
-	if (total_len > readRecordBufSize)
+	do
 	{
-		uint32		newSize = total_len;
-
-		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
-		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
-		if (readRecordBuf)
-			free(readRecordBuf);
-		readRecordBuf = (char *) malloc(newSize);
-		if (!readRecordBuf)
+		char *errormsg;
+		record = XLogReadRecord(xlogreader, RecPtr, &errormsg);
+		ReadRecPtr = xlogreader->ReadRecPtr;
+		EndRecPtr = xlogreader->EndRecPtr;
+		if (record == NULL)
 		{
-			readRecordBufSize = 0;
-			/* We treat this as a "bogus data" condition */
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record length %u at %X/%X too long",
-							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			goto next_record_is_invalid;
-		}
-		readRecordBufSize = newSize;
-	}
-
-	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
-	if (total_len > len)
-	{
-		/* Need to reassemble record */
-		char	   *contrecord;
-		XLogPageHeader pageHeader;
-		XLogRecPtr	pagelsn;
-		char	   *buffer;
-		uint32		gotlen;
-
-		/* Initialize pagelsn to the beginning of the page this record is on */
-		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-
-		/* Copy the first fragment of the record from the first page. */
-		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
-		buffer = readRecordBuf + len;
-		gotlen = len;
+			ereport(emode_for_corrupt_record(xlogreader, RecPtr),
+					(errmsg_internal("%s", errormsg) /* already translated */));
 
-		do
-		{
-			/* Calculate pointer to beginning of next page */
-			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
-			/* Wait for the next page to become available */
-			if (!XLogPageRead(&pagelsn, emode, false, false))
-				return NULL;
-
-			/* Check that the continuation on next page looks valid */
-			pageHeader = (XLogPageHeader) readBuf;
-			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
-			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("there is no contrecord flag in log segment %s, offset %u",
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
-			}
-			/*
-			 * Cross-check that xlp_rem_len agrees with how much of the record
-			 * we expect there to be left.
-			 */
-			if (pageHeader->xlp_rem_len == 0 ||
-				total_len != (pageHeader->xlp_rem_len + gotlen))
-			{
-				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
-								pageHeader->xlp_rem_len,
-								XLogFileNameP(curFileTLI, readSegNo),
-								readOff)));
-				goto next_record_is_invalid;
-			}
+			private->lastSourceFailed = true;
 
-			/* Append the continuation from this page to the buffer */
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
-			contrecord = (char *) readBuf + pageHeaderSize;
-			len = XLOG_BLCKSZ - pageHeaderSize;
-			if (pageHeader->xlp_rem_len < len)
-				len = pageHeader->xlp_rem_len;
-			memcpy(buffer, (char *) contrecord, len);
-			buffer += len;
-			gotlen += len;
-
-			/* If we just reassembled the record header, validate it. */
-			if (!gotheader)
+			if (private->readFile >= 0)
 			{
-				record = (XLogRecord *) readRecordBuf;
-				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
-					goto next_record_is_invalid;
-				gotheader = true;
+				close(private->readFile);
+				private->readFile = -1;
 			}
-		} while (pageHeader->xlp_rem_len > len);
-
-		record = (XLogRecord *) readRecordBuf;
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-		XLogSegNoOffsetToRecPtr(
-			readSegNo,
-			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
-			EndRecPtr);
-		ReadRecPtr = *RecPtr;
-	}
-	else
-	{
-		/* Record does not cross a page boundary */
-		if (!RecordIsValid(record, *RecPtr, emode))
-			goto next_record_is_invalid;
-		EndRecPtr = *RecPtr + MAXALIGN(total_len);
-
-		ReadRecPtr = *RecPtr;
-		memcpy(readRecordBuf, record, total_len);
-	}
-
-	/*
-	 * Special processing if it's an XLOG SWITCH record
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		/* Pretend it extends to end of segment */
-		EndRecPtr += XLogSegSize - 1;
-		EndRecPtr -= EndRecPtr % XLogSegSize;
-
-		/*
-		 * Pretend that readBuf contains the last page of the segment. This is
-		 * just to avoid Assert failure in StartupXLOG if XLOG ends with this
-		 * segment.
-		 */
-		readOff = XLogSegSize - XLOG_BLCKSZ;
-	}
-	return record;
-
-next_record_is_invalid:
-	lastSourceFailed = true;
-
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-
-	/* In standby-mode, keep trying */
-	if (StandbyMode)
-		goto retry;
-	else
-		return NULL;
-}
-
-/*
- * Check whether the xlog header of a page just read in looks valid.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord.	It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogPageHeader(XLogPageHeader hdr, int emode, bool segmentonly)
-{
-	XLogRecPtr	recaddr;
-
-	XLogSegNoOffsetToRecPtr(readSegNo, readOff, recaddr);
-
-	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
-	{
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid magic number %04X in log segment %s, offset %u",
-						hdr->xlp_magic,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
-	{
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid info bits %04X in log segment %s, offset %u",
-						hdr->xlp_info,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-	if (hdr->xlp_info & XLP_LONG_HEADER)
-	{
-		XLogLongPageHeader longhdr = (XLogLongPageHeader) hdr;
-
-		if (longhdr->xlp_sysid != ControlFile->system_identifier)
-		{
-			char		fhdrident_str[32];
-			char		sysident_str[32];
-
-			/*
-			 * Format sysids separately to keep platform-dependent format code
-			 * out of the translatable message string.
-			 */
-			snprintf(fhdrident_str, sizeof(fhdrident_str), UINT64_FORMAT,
-					 longhdr->xlp_sysid);
-			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
-					 ControlFile->system_identifier);
-			ereport(emode_for_corrupt_record(emode, recaddr),
-					(errmsg("WAL file is from different database system"),
-					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
-							   fhdrident_str, sysident_str)));
-			return false;
-		}
-		if (longhdr->xlp_seg_size != XLogSegSize)
-		{
-			ereport(emode_for_corrupt_record(emode, recaddr),
-					(errmsg("WAL file is from different database system"),
-					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
-			return false;
-		}
-		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
-		{
-			ereport(emode_for_corrupt_record(emode, recaddr),
-					(errmsg("WAL file is from different database system"),
-					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
-			return false;
-		}
-	}
-	else if (readOff == 0)
-	{
-		/* hmm, first page of file doesn't have a long header? */
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid info bits %04X in log segment %s, offset %u",
-						hdr->xlp_info,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-
-	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
-	{
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
-						(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-
-	/*
-	 * Check page TLI is one of the expected values.
-	 */
-	if (!tliInHistory(hdr->xlp_tli, expectedTLEs))
-	{
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
-						hdr->xlp_tli,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-
-	/*
-	 * Since child timelines are always assigned a TLI greater than their
-	 * immediate parent's TLI, we should never see TLI go backwards across
-	 * successive pages of a consistent WAL sequence.
-	 *
-	 * Of course this check should only be applied when advancing sequentially
-	 * across pages; therefore ReadRecord resets lastPageTLI and
-	 * lastSegmentTLI to zero when going to a random page.
-	 *
-	 * Sometimes we re-open a segment that's already been partially replayed.
-	 * In that case we cannot perform the normal TLI check: if there is a
-	 * timeline switch within the segment, the first page has a smaller TLI
-	 * than later pages following the timeline switch, and we might've read
-	 * them already. As a weaker test, we still check that it's not smaller
-	 * than the TLI we last saw at the beginning of a segment. Pass
-	 * segmentonly = true when re-validating the first page like that, and the
-	 * page you're actually interested in comes later.
-	 */
-	if (hdr->xlp_tli < (segmentonly ? lastSegmentTLI : lastPageTLI))
-	{
-		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
-						hdr->xlp_tli,
-						segmentonly ? lastSegmentTLI : lastPageTLI,
-						XLogFileNameP(curFileTLI, readSegNo),
-						readOff)));
-		return false;
-	}
-	lastPageTLI = hdr->xlp_tli;
-	if (readOff == 0)
-		lastSegmentTLI = hdr->xlp_tli;
-
-	return true;
-}
-
-/*
- * Validate an XLOG record header.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord.	It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
-					  bool randAccess)
-{
-	/*
-	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
-	 * required.
-	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		if (record->xl_len != 0)
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid xlog switch record at %X/%X",
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
-	else if (record->xl_len == 0)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("record with zero length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
-		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
-		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record length at %X/%X",
-						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (record->xl_rmid > RM_MAX_ID)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid resource manager ID %u at %X/%X",
-						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-		return false;
-	}
-	if (randAccess)
-	{
-		/*
-		 * We can't exactly verify the prev-link, but surely it should be less
-		 * than the record's own address.
-		 */
-		if (!XLByteLT(record->xl_prev, *RecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
 		}
-	}
-	else
-	{
-		/*
-		 * Record's prev-link should exactly match our previous location. This
-		 * check guards against torn WAL pages where a stale but valid-looking
-		 * WAL record starts on a sector boundary.
-		 */
-		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
-							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
-			return false;
-		}
-	}
+	} while (StandbyMode && record == NULL);
 
-	return true;
+	return record;
 }
 
 /*
@@ -4792,7 +4223,8 @@ readRecoveryCommandFile(void)
  * Exit archive-recovery state
  */
 static void
-exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
+exitArchiveRecovery(XLogPageReadPrivate *private, TimeLineID endTLI,
+					XLogSegNo endLogSegNo)
 {
 	char		recoveryPath[MAXPGPATH];
 	char		xlogpath[MAXPGPATH];
@@ -4811,10 +4243,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (private->readFile >= 0)
 	{
-		close(readFile);
-		readFile = -1;
+		close(private->readFile);
+		private->readFile = -1;
 	}
 
 	/*
@@ -5255,6 +4687,8 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	XLogReaderState *xlogreader;
+	XLogPageReadPrivate private;
 
 	/*
 	 * Read control file and check XLOG status looks valid.
@@ -5416,6 +4850,18 @@ StartupXLOG(void)
 	if (StandbyMode)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
+	/* Set up XLOG reader facility */
+	MemSet(&private, 0, sizeof(XLogPageReadPrivate));
+	private.readFile = -1;
+	xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, &private);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor")));
+	xlogreader->system_identifier = ControlFile->system_identifier;
+	xlogreader->expectedTLEs = expectedTLEs;
+
 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
 						  &backupFromStandby))
 	{
@@ -5423,7 +4869,7 @@ StartupXLOG(void)
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
-		record = ReadCheckpointRecord(checkPointLoc, 0);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
@@ -5441,7 +4887,7 @@ StartupXLOG(void)
 			 */
 			if (XLByteLT(checkPoint.redo, checkPointLoc))
 			{
-				if (!ReadRecord(&(checkPoint.redo), LOG, false))
+				if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
 							 errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
@@ -5465,7 +4911,7 @@ StartupXLOG(void)
 		 */
 		checkPointLoc = ControlFile->checkPoint;
 		RedoStartLSN = ControlFile->checkPointCopy.redo;
-		record = ReadCheckpointRecord(checkPointLoc, 1);
+		record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
 		if (record != NULL)
 		{
 			ereport(DEBUG1,
@@ -5484,7 +4930,7 @@ StartupXLOG(void)
 		else
 		{
 			checkPointLoc = ControlFile->prevCheckPoint;
-			record = ReadCheckpointRecord(checkPointLoc, 2);
+			record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
 			if (record != NULL)
 			{
 				ereport(LOG,
@@ -5800,12 +5246,12 @@ StartupXLOG(void)
 		if (XLByteLT(checkPoint.redo, RecPtr))
 		{
 			/* back up to find the record */
-			record = ReadRecord(&(checkPoint.redo), PANIC, false);
+			record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
 		}
 		else
 		{
 			/* just have to read next record after CheckPoint */
-			record = ReadRecord(NULL, LOG, false);
+			record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 		}
 
 		if (record != NULL)
@@ -5968,7 +5414,7 @@ StartupXLOG(void)
 					break;
 
 				/* Else, try to fetch the next WAL record */
-				record = ReadRecord(NULL, LOG, false);
+				record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
 			} while (record != NULL);
 
 			/*
@@ -6018,7 +5464,7 @@ StartupXLOG(void)
 	 * Re-fetch the last valid or last applied record, so we can identify the
 	 * exact endpoint of what we consider the valid portion of WAL.
 	 */
-	record = ReadRecord(&LastRec, PANIC, false);
+	record = ReadRecord(xlogreader, LastRec, PANIC, false);
 	EndOfLog = EndRecPtr;
 	XLByteToPrevSeg(EndOfLog, endLogSegNo);
 
@@ -6122,7 +5568,7 @@ StartupXLOG(void)
 	 * we will use that below.)
 	 */
 	if (InArchiveRecovery)
-		exitArchiveRecovery(curFileTLI, endLogSegNo);
+		exitArchiveRecovery(&private, xlogreader->readPageTLI, endLogSegNo);
 
 	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
@@ -6141,8 +5587,15 @@ StartupXLOG(void)
 	 * record spans, not the one it starts in.	The last block is indeed the
 	 * one we want to use.
 	 */
-	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
-	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
+	if (EndOfLog % XLOG_BLCKSZ == 0)
+	{
+		memset(Insert->currpage, 0, XLOG_BLCKSZ);
+	}
+	else
+	{
+		Assert(private.readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
+		memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
+	}
 	Insert->currpos = (char *) Insert->currpage +
 		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
 
@@ -6293,23 +5746,13 @@ StartupXLOG(void)
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
-	/* Shut down readFile facility, free space */
-	if (readFile >= 0)
+	/* Shut down xlogreader */
+	if (private.readFile >= 0)
 	{
-		close(readFile);
-		readFile = -1;
-	}
-	if (readBuf)
-	{
-		free(readBuf);
-		readBuf = NULL;
-	}
-	if (readRecordBuf)
-	{
-		free(readRecordBuf);
-		readRecordBuf = NULL;
-		readRecordBufSize = 0;
+		close(private.readFile);
+		private.readFile = -1;
 	}
+	XLogReaderFree(xlogreader);
 
 	/*
 	 * If any of the critical GUCs have changed, log them before we allow
@@ -6523,7 +5966,7 @@ LocalSetXLogInsertAllowed(void)
  * 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
  */
 static XLogRecord *
-ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
+ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
 {
 	XLogRecord *record;
 
@@ -6547,7 +5990,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
 		return NULL;
 	}
 
-	record = ReadRecord(&RecPtr, LOG, true);
+	record = ReadRecord(xlogreader, RecPtr, LOG, true);
 
 	if (record == NULL)
 	{
@@ -9311,28 +8754,24 @@ CancelBackup(void)
  * XLogPageRead() to try fetching the record from another source, or to
  * sleep and retry.
  */
-static bool
-XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
-			 bool randAccess)
+static int
+XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
+			 char *readBuf, TimeLineID *readTLI)
 {
+	XLogPageReadPrivate *private =
+		(XLogPageReadPrivate *) xlogreader->private_data;
+	int			emode = private->emode;
 	uint32		targetPageOff;
-	uint32		targetRecOff;
 	XLogSegNo	targetSegNo;
 
-	XLByteToSeg(*RecPtr, targetSegNo);
-	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
-
-	/* Fast exit if we have read the record in the current buffer already */
-	if (!lastSourceFailed && targetSegNo == readSegNo &&
-		targetPageOff == readOff && targetRecOff < readLen)
-		return true;
+	XLByteToSeg(targetPagePtr, targetSegNo);
+	targetPageOff = targetPagePtr % XLogSegSize;
 
 	/*
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
+	if (private->readFile >= 0 && !XLByteInSeg(targetPagePtr, private->readSegNo))
 	{
 		/*
 		 * Request a restartpoint if we've replayed too much xlog since the
@@ -9340,52 +8779,46 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		 */
 		if (StandbyMode && bgwriterLaunched)
 		{
-			if (XLogCheckpointNeeded(readSegNo))
+			if (XLogCheckpointNeeded(private->readSegNo))
 			{
 				(void) GetRedoRecPtr();
-				if (XLogCheckpointNeeded(readSegNo))
+				if (XLogCheckpointNeeded(private->readSegNo))
 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
 		}
 
-		close(readFile);
-		readFile = -1;
-		readSource = 0;
+		close(private->readFile);
+		private->readFile = -1;
+		private->readSource = 0;
 	}
 
-	XLByteToSeg(*RecPtr, readSegNo);
+	XLByteToSeg(targetPagePtr, private->readSegNo);
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
+	if (private->readFile < 0 ||
+		(private->readSource == XLOG_FROM_STREAM &&
+		 !XLByteLT(targetPagePtr + reqLen, receivedUpto)))
 	{
 		if (StandbyMode)
 		{
-			if (!WaitForWALToBecomeAvailable(*RecPtr, randAccess,
-											 fetching_ckpt))
+			if (!WaitForWALToBecomeAvailable(private, targetPagePtr + reqLen))
 				goto triggered;
 		}
-		else
+		/* In archive or crash recovery. */
+		else if (private->readFile < 0)
 		{
-			/* In archive or crash recovery. */
-			if (readFile < 0)
-			{
-				int			source;
-
-				/* Reset curFileTLI if random fetch. */
-				if (randAccess)
-					curFileTLI = 0;
+			int source;
 
-				if (InArchiveRecovery)
-					source = XLOG_FROM_ANY;
-				else
-					source = XLOG_FROM_PG_XLOG;
+			if (InArchiveRecovery)
+				source = XLOG_FROM_ANY;
+			else
+				source = XLOG_FROM_PG_XLOG;
 
-				readFile = XLogFileReadAnyTLI(readSegNo, emode, source);
-				if (readFile < 0)
-					return false;
-			}
+			private->readFile =
+				XLogFileReadAnyTLI(private, private->readSegNo, emode, source);
+			if (private->readFile < 0)
+				return -1;
 		}
 	}
 
@@ -9393,7 +8826,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(private->readFile != -1);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -9401,98 +8834,72 @@ retry:
 	 * requested record has been received, but this is for the benefit of
 	 * future calls, to allow quick exit at the top of this function.
 	 */
-	if (readSource == XLOG_FROM_STREAM)
+	if (private->readSource == XLOG_FROM_STREAM)
 	{
-		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
-		{
-			readLen = XLOG_BLCKSZ;
-		}
+		if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+			private->readLen = XLOG_BLCKSZ;
 		else
-			readLen = receivedUpto % XLogSegSize - targetPageOff;
+			private->readLen = receivedUpto % XLogSegSize - targetPageOff;
 	}
 	else
-		readLen = XLOG_BLCKSZ;
-
-	if (!readFileHeaderValidated && targetPageOff != 0)
-	{
-		/*
-		 * Whenever switching to a new WAL segment, we read the first page of
-		 * the file and validate its header, even if that's not where the
-		 * target record is.  This is so that we can check the additional
-		 * identification info that is present in the first page's "long"
-		 * header.
-		 */
-		readOff = 0;
-		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-		{
-			char fname[MAXFNAMELEN];
-			XLogFileName(fname, curFileTLI, readSegNo);
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
-			goto next_record_is_invalid;
-		}
-		if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode, true))
-			goto next_record_is_invalid;
-	}
+		private->readLen = XLOG_BLCKSZ;
 
 	/* Read the requested page */
-	readOff = targetPageOff;
-	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
+	private->readOff = targetPageOff;
+	if (lseek(private->readFile, (off_t) private->readOff, SEEK_SET) < 0)
 	{
-		char fname[MAXFNAMELEN];
-		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, curFileTLI, private->readSegNo);
+		ereport(emode_for_corrupt_record(xlogreader, targetPagePtr + reqLen),
 				(errcode_for_file_access(),
-		 errmsg("could not seek in log segment %s to offset %u: %m",
-				fname, readOff)));
+				 errmsg("could not seek in log segment %s to offset %u: %m",
+						fname, private->readOff)));
 		goto next_record_is_invalid;
 	}
-	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+
+	if (read(private->readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 	{
-		char fname[MAXFNAMELEN];
-		XLogFileName(fname, curFileTLI, readSegNo);
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, curFileTLI, private->readSegNo);
+		ereport(emode_for_corrupt_record(xlogreader, targetPagePtr + reqLen),
 				(errcode_for_file_access(),
-		 errmsg("could not read from log segment %s, offset %u: %m",
-				fname, readOff)));
+				 errmsg("could not read from log segment %s, offset %u: %m",
+						fname, private->readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode, false))
-		goto next_record_is_invalid;
 
-	readFileHeaderValidated = true;
+	Assert(targetSegNo == private->readSegNo);
+	Assert(targetPageOff == private->readOff);
+	Assert(reqLen <= private->readLen);
 
-	Assert(targetSegNo == readSegNo);
-	Assert(targetPageOff == readOff);
-	Assert(targetRecOff < readLen);
-
-	return true;
+	*readTLI = curFileTLI;
+	return private->readLen;
 
 next_record_is_invalid:
-	lastSourceFailed = true;
+	private->lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
-	readFile = -1;
-	readLen = 0;
-	readSource = 0;
+	if (private->readFile >= 0)
+		close(private->readFile);
+	private->readFile = -1;
+	private->readLen = 0;
+	private->readSource = 0;
 
 	/* In standby-mode, keep trying */
 	if (StandbyMode)
 		goto retry;
 	else
-		return false;
+		return -1;
 
 triggered:
-	if (readFile >= 0)
-		close(readFile);
-	readFile = -1;
-	readLen = 0;
-	readSource = 0;
+	if (private->readFile >= 0)
+		close(private->readFile);
+	private->readFile = -1;
+	private->readLen = 0;
+	private->readSource = 0;
 
-	return false;
+	return -1;
 }
 
 /*
@@ -9507,8 +8914,7 @@ triggered:
  * false.
  */
 static bool
-WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt)
+WaitForWALToBecomeAvailable(XLogPageReadPrivate *private, XLogRecPtr RecPtr)
 {
 	static pg_time_t last_fail_time = 0;
 	pg_time_t now;
@@ -9534,12 +8940,12 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * part of advancing to the next state.
 	 *-------
 	 */
-	if (currentSource == 0)
-		currentSource = XLOG_FROM_ARCHIVE;
+	if (private->currentSource == 0)
+		private->currentSource = XLOG_FROM_ARCHIVE;
 
 	for (;;)
 	{
-		int		oldSource = currentSource;
+		int		oldSource = private->currentSource;
 
 		/*
 		 * First check if we failed to read from the current source, and
@@ -9547,13 +8953,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 * happened outside this function, e.g when a CRC check fails on a
 		 * record, or within this loop.
 		 */
-		if (lastSourceFailed)
+		if (private->lastSourceFailed)
 		{
 
-			switch (currentSource)
+			switch (private->currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
-					currentSource = XLOG_FROM_PG_XLOG;
+					private->currentSource = XLOG_FROM_PG_XLOG;
 					break;
 
 				case XLOG_FROM_PG_XLOG:
@@ -9578,7 +8984,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (PrimaryConnInfo)
 					{
-						XLogRecPtr ptr = fetching_ckpt ? RedoStartLSN : RecPtr;
+						XLogRecPtr ptr = private->fetching_ckpt ? RedoStartLSN : RecPtr;
 
 						RequestXLogStreaming(ptr, PrimaryConnInfo);
 					}
@@ -9587,7 +8993,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * immediate failure if we didn't launch walreceiver, and
 					 * move on to the next state.
 					 */
-					currentSource = XLOG_FROM_STREAM;
+					private->currentSource = XLOG_FROM_STREAM;
 					break;
 
 				case XLOG_FROM_STREAM:
@@ -9620,7 +9026,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					{
 						if (rescanLatestTimeLine())
 						{
-							currentSource = XLOG_FROM_ARCHIVE;
+							private->currentSource = XLOG_FROM_ARCHIVE;
 							break;
 						}
 					}
@@ -9639,60 +9045,61 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						now = (pg_time_t) time(NULL);
 					}
 					last_fail_time = now;
-					currentSource = XLOG_FROM_ARCHIVE;
+					private->currentSource = XLOG_FROM_ARCHIVE;
 					break;
 
 				default:
-					elog(ERROR, "unexpected WAL source %d", currentSource);
+					elog(ERROR, "unexpected WAL source %d", private->currentSource);
 			}
 		}
-		else if (currentSource == XLOG_FROM_PG_XLOG)
+		else if (private->currentSource == XLOG_FROM_PG_XLOG)
 		{
 			/*
 			 * We just successfully read a file in pg_xlog. We prefer files
 			 * in the archive over ones in pg_xlog, so try the next file
 			 * again from the archive first.
 			 */
-			currentSource = XLOG_FROM_ARCHIVE;
+			private->currentSource = XLOG_FROM_ARCHIVE;
 		}
 
-		if (currentSource != oldSource)
+		if (private->currentSource != oldSource)
 			elog(DEBUG2, "switched WAL source from %s to %s after %s",
-				 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
-				 lastSourceFailed ? "failure" : "success");
+				 xlogSourceNames[oldSource],
+				 xlogSourceNames[private->currentSource],
+				 private->lastSourceFailed ? "failure" : "success");
 
 		/*
 		 * We've now handled possible failure. Try to read from the chosen
 		 * source.
 		 */
-		lastSourceFailed = false;
+		private->lastSourceFailed = false;
 
-		switch (currentSource)
+		switch (private->currentSource)
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_XLOG:
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (private->readFile >= 0)
 				{
-					close(readFile);
-					readFile = -1;
+					close(private->readFile);
+					private->readFile = -1;
 				}
-				/* Reset curFileTLI if random fetch. */
-				if (randAccess)
-					curFileTLI = 0;
 
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_xlog.
 				 */
-				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, currentSource);
-				if (readFile >= 0)
+				private->readFile =
+					XLogFileReadAnyTLI(private, private->readSegNo, DEBUG2,
+									   private->currentSource);
+
+				if (private->readFile >= 0)
 					return true;	/* success! */
 
 				/*
 				 * Nope, not found in archive or pg_xlog.
 				 */
-				lastSourceFailed = true;
+				private->lastSourceFailed = true;
 				break;
 
 			case XLOG_FROM_STREAM:
@@ -9704,7 +9111,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				if (!WalRcvInProgress())
 				{
-					lastSourceFailed = true;
+					private->lastSourceFailed = true;
 					break;
 				}
 
@@ -9745,17 +9152,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * open already.  Use XLOG_FROM_STREAM so that source info
 					 * is set correctly and XLogReceiptTime isn't changed.
 					 */
-					if (readFile < 0)
+					if (private->readFile < 0)
 					{
-						readFile = XLogFileRead(readSegNo, PANIC,
-												recoveryTargetTLI,
-												XLOG_FROM_STREAM, false);
-						Assert(readFile >= 0);
+						private->readFile =
+							XLogFileRead(private, private->readSegNo, PANIC,
+										 recoveryTargetTLI,
+										 XLOG_FROM_STREAM, false);
+						Assert(private->readFile >= 0);
 					}
 					else
 					{
 						/* just make sure source info is correct... */
-						readSource = XLOG_FROM_STREAM;
+						private->readSource = XLOG_FROM_STREAM;
 						XLogReceiptSource = XLOG_FROM_STREAM;
 						return true;
 					}
@@ -9776,7 +9184,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * will move on to replay the streamed WAL from pg_xlog,
 					 * and then recheck the trigger and exit replay.
 					 */
-					lastSourceFailed = true;
+					private->lastSourceFailed = true;
 					break;
 				}
 
@@ -9793,7 +9201,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			}
 
 			default:
-				elog(ERROR, "unexpected WAL source %d", currentSource);
+				elog(ERROR, "unexpected WAL source %d", private->currentSource);
 		}
 
 		/*
@@ -9825,11 +9233,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
  * erroneously suppressed.
  */
 static int
-emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
+emode_for_corrupt_record(XLogReaderState *state, XLogRecPtr RecPtr)
 {
+	XLogPageReadPrivate *private = (XLogPageReadPrivate *) state->private_data;
+	int			emode = private->emode;
 	static XLogRecPtr lastComplaint = 0;
 
-	if (readSource == XLOG_FROM_PG_XLOG && emode == LOG)
+	if (private->readSource == XLOG_FROM_PG_XLOG && emode == LOG)
 	{
 		if (XLByteEQ(RecPtr, lastComplaint))
 			emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..7ebb5bd
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,992 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ *		Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/backend/access/transam/xlogreader.c
+ *
+ * NOTES
+ *		Documentation about how do use this interface can be found in
+ *		xlogreader.h, more specifically in the definition of the
+ *		XLogReaderState struct where all parameters are documented.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "catalog/pg_control.h"
+
+static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
+
+static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
+								XLogPageHeader hdr);
+static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+		XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
+static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
+						    XLogRecPtr recptr);
+static int ReadPageInternal(struct XLogReaderState *state, XLogRecPtr pageptr,
+				 int reqLen);
+static void report_invalid_record(XLogReaderState *state, const char *fmt, ...)
+/* This extension allows gcc to check the format string for consistency with
+   the supplied arguments. */
+__attribute__((format(PG_PRINTF_ATTRIBUTE, 2, 3)));
+
+/* size of the buffer allocated for error message. */
+#define MAX_ERRORMSG_LEN 1000
+
+/*
+ * Construct a string in state->errormsg_buf explaining what's wrong with
+ * the current record being read.
+ */
+static void
+report_invalid_record(XLogReaderState *state, const char *fmt, ...)
+{
+	va_list	args;
+	va_start(args, fmt);
+	vsnprintf(state->errormsg_buf, MAX_ERRORMSG_LEN, fmt, args);
+	va_end(args);
+}
+
+/*
+ * Allocate and initialize a new xlog reader
+ *
+ * Returns NULL if the xlogreader couldn't be allocated.
+ */
+XLogReaderState *
+XLogReaderAllocate(XLogRecPtr startpoint, XLogPageReadCB pagereadfunc,
+				   void *private_data)
+{
+	XLogReaderState *state;
+
+	state = (XLogReaderState *) malloc(sizeof(XLogReaderState));
+	if (!state)
+		return NULL;
+	MemSet(state, 0, sizeof(XLogReaderState));
+
+	/*
+	 * Permanently allocate readBuf.  We do it this way, rather than just
+	 * making a static array, for two reasons: (1) no need to waste the
+	 * storage in most instantiations of the backend; (2) a static char array
+	 * isn't guaranteed to have any particular alignment, whereas malloc()
+	 * will provide MAXALIGN'd storage.
+	 */
+	state->readBuf = (char *) malloc(XLOG_BLCKSZ);
+	if (!state->readBuf)
+	{
+		free(state);
+		return NULL;
+	}
+
+	state->read_page = pagereadfunc;
+	state->private_data = private_data;
+	state->EndRecPtr = startpoint;
+	state->readPageTLI = 0;
+	state->expectedTLEs = NIL;
+	state->system_identifier = 0;
+	state->errormsg_buf = malloc(MAX_ERRORMSG_LEN + 1);
+	if (!state->errormsg_buf)
+	{
+		free(state->readBuf);
+		free(state);
+		return NULL;
+	}
+	state->errormsg_buf[0] = '\0';
+
+	/*
+	 * Allocate an initial readRecordBuf of minimal size, which can later be
+	 * enlarged if necessary.
+	 */
+	if (!allocate_recordbuf(state, 0))
+	{
+		free(state->errormsg_buf);
+		free(state->readBuf);
+		free(state);
+		return NULL;
+	}
+
+	return state;
+}
+
+void
+XLogReaderFree(XLogReaderState *state)
+{
+	free(state->errormsg_buf);
+	if (state->readRecordBuf)
+		free(state->readRecordBuf);
+	free(state->readBuf);
+	free(state);
+}
+
+/*
+ * Allocate readRecordBuf to fit a record of at least the given length.
+ * Returns true if successful, false if out of memory.
+ *
+ * readRecordBufSize is set to the new buffer size.
+ *
+ * To avoid useless small increases, round its size to a multiple of
+ * XLOG_BLCKSZ, and make sure it's at least 5*Max(BLCKSZ, XLOG_BLCKSZ) to start
+ * with.  (That is enough for all "normal" records, but very large commit or
+ * abort records might need more space.)
+ */
+static bool
+allocate_recordbuf(XLogReaderState *state, uint32 reclength)
+{
+	uint32		newSize = reclength;
+
+	newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+	newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
+
+	if (state->readRecordBuf)
+		free(state->readRecordBuf);
+	state->readRecordBuf = (char *) malloc(newSize);
+	if (!state->readRecordBuf)
+	{
+		state->readRecordBufSize = 0;
+		return false;
+	}
+
+	state->readRecordBufSize = newSize;
+	return true;
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * If RecPtr is not NULL, try to read a record at that position.  Otherwise
+ * try to read a record just after the last one previously read.
+ *
+ * If no valid record is available, returns NULL. On NULL return, *errormsg
+ * is usually set to a string with details of the failure. One typical error
+ * where *errormsg is not set is when the read_page callback returns an error.
+ *
+ * The returned pointer (or *errormsg) points to an internal buffer that's
+ * valid until the next call to XLogReadRecord.
+ */
+XLogRecord *
+XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
+{
+	XLogRecord *record;
+	XLogRecPtr	tmpRecPtr = state->EndRecPtr;
+	XLogRecPtr  targetPagePtr;
+	bool		randAccess = false;
+	uint32		len,
+				total_len;
+	uint32		targetRecOff;
+	uint32		pageHeaderSize;
+	bool		gotheader;
+	int         readOff;
+
+	*errormsg = NULL;
+	state->errormsg_buf[0] = '\0';
+
+	if (RecPtr == InvalidXLogRecPtr)
+	{
+		RecPtr = tmpRecPtr;
+
+		if (state->ReadRecPtr == InvalidXLogRecPtr)
+			randAccess = true;
+
+		/*
+		 * RecPtr is pointing to end+1 of the previous WAL record.	If we're
+		 * at a page boundary, no more records can fit on the current page. We
+		 * must skip over the page header, but we can't do that until we've
+		 * read in the page, since the header size is variable.
+		 */
+	}
+	else
+	{
+		/*
+		 * In this case, the passed-in record pointer should already be
+		 * pointing to a valid record starting position.
+		 */
+		Assert(XRecOffIsValid(RecPtr));
+		randAccess = true;		/* allow readPageTLI to go backwards too */
+	}
+
+	targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
+
+	/* Read the page containing the record into state->readBuf */
+	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
+
+	if (readOff < 0)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		return NULL;
+	}
+
+	/* ReadPageInternal always returns at least the page header */
+	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
+	if (targetRecOff == 0)
+	{
+		/*
+		 * At page start, so skip over page header.
+		 */
+		RecPtr += pageHeaderSize;
+		targetRecOff = pageHeaderSize;
+	}
+	else if (targetRecOff < pageHeaderSize)
+	{
+		report_invalid_record(state, "invalid record offset at %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		*errormsg = state->errormsg_buf;
+		return NULL;
+	}
+
+	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+		targetRecOff == pageHeaderSize)
+	{
+		report_invalid_record(state, "contrecord is requested by %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		*errormsg = state->errormsg_buf;
+		return NULL;
+	}
+
+	/* ReadPageInternal has verified the page header */
+	Assert(pageHeaderSize <= readOff);
+
+	/*
+	 * Ensure the whole record header or at least the part on this page is
+	 * read.
+	 */
+	readOff = ReadPageInternal(state,
+							   targetPagePtr,
+							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+	if (readOff < 0)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		return NULL;
+	}
+
+	/*
+	 * Read the record length.
+	 *
+	 * NB: Even though we use an XLogRecord pointer here, the whole record
+	 * header might not fit on this page. xl_tot_len is the first field of the
+	 * struct, so it must be on this page (the records are MAXALIGNed), but we
+	 * cannot access any other fields until we've verified that we got the
+	 * whole header.
+	 */
+	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+	total_len = record->xl_tot_len;
+
+	/*
+	 * If the whole record header is on this page, validate it immediately.
+	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+	 * rest of the header after reading it from the next page.	The xl_tot_len
+	 * check is necessary here to ensure that we enter the "Need to reassemble
+	 * record" code path below; otherwise we might fail to apply
+	 * ValidXLogRecordHeader at all.
+	 */
+	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+	{
+		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
+								   randAccess))
+		{
+			if (state->errormsg_buf[0] != '\0')
+				*errormsg = state->errormsg_buf;
+			return NULL;
+		}
+		gotheader = true;
+	}
+	else
+	{
+		/* XXX: more validation should be done here */
+		if (total_len < SizeOfXLogRecord)
+		{
+			report_invalid_record(state, "invalid record length at %X/%X",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			*errormsg = state->errormsg_buf;
+			return NULL;
+		}
+		gotheader = false;
+	}
+
+	/*
+	 * Enlarge readRecordBuf as needed.
+	 */
+	if (total_len > state->readRecordBufSize &&
+		!allocate_recordbuf(state, total_len))
+	{
+		/* We treat this as a "bogus data" condition */
+		report_invalid_record(state, "record length %u at %X/%X too long",
+							  total_len,
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		*errormsg = state->errormsg_buf;
+		return NULL;
+	}
+
+	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+	if (total_len > len)
+	{
+		/* Need to reassemble record */
+		char	   *contdata;
+		XLogPageHeader pageHeader;
+		char	   *buffer;
+		uint32		gotlen;
+
+		/* Copy the first fragment of the record from the first page. */
+		memcpy(state->readRecordBuf,
+			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+		buffer = state->readRecordBuf + len;
+		gotlen = len;
+
+		do
+		{
+			/* Calculate pointer to beginning of next page */
+			XLByteAdvance(targetPagePtr, XLOG_BLCKSZ);
+
+			/* Wait for the next page to become available */
+			readOff = ReadPageInternal(state, targetPagePtr,
+									   Min(len, XLOG_BLCKSZ));
+
+			if (readOff < 0)
+				goto err;
+
+			Assert(SizeOfXLogShortPHD <= readOff);
+
+			/* Check that the continuation on next page looks valid */
+			pageHeader = (XLogPageHeader) state->readBuf;
+			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+			{
+				report_invalid_record(state,
+									  "there is no contrecord flag at %X/%X",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+				goto err;
+			}
+
+			/*
+			 * Cross-check that xlp_rem_len agrees with how much of the record
+			 * we expect there to be left.
+			 */
+			if (pageHeader->xlp_rem_len == 0 ||
+				total_len != (pageHeader->xlp_rem_len + gotlen))
+			{
+				report_invalid_record(state,
+									  "invalid contrecord length %u at %X/%X",
+									  pageHeader->xlp_rem_len,
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+				goto err;
+			}
+
+			/* Append the continuation from this page to the buffer */
+			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			Assert(pageHeaderSize <= readOff);
+
+			contdata = (char *) state->readBuf + pageHeaderSize;
+			len = XLOG_BLCKSZ - pageHeaderSize;
+			if (pageHeader->xlp_rem_len < len)
+				len = pageHeader->xlp_rem_len;
+
+			memcpy(buffer, (char *) contdata, len);
+			buffer += len;
+			gotlen += len;
+
+			/* If we just reassembled the record header, validate it. */
+			if (!gotheader)
+			{
+				record = (XLogRecord *) state->readRecordBuf;
+				if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr,
+										   record, randAccess))
+					goto err;
+				gotheader = true;
+			}
+		} while (gotlen < total_len);
+
+		Assert(gotheader);
+
+		record = (XLogRecord *) state->readRecordBuf;
+		if (!ValidXLogRecord(state, record, RecPtr))
+			goto err;
+
+		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+		state->ReadRecPtr = RecPtr;
+		state->EndRecPtr = targetPagePtr + pageHeaderSize
+			+ MAXALIGN(pageHeader->xlp_rem_len);
+	}
+	else
+	{
+		/* Wait for the record data to become available */
+		readOff = ReadPageInternal(state, targetPagePtr,
+								   Min(targetRecOff + total_len, XLOG_BLCKSZ));
+		if (readOff < 0)
+			goto err;
+
+		/* Record does not cross a page boundary */
+		if (!ValidXLogRecord(state, record, RecPtr))
+			goto err;
+
+		state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+
+		state->ReadRecPtr = RecPtr;
+		memcpy(state->readRecordBuf, record, total_len);
+	}
+
+	/*
+	 * Special processing if it's an XLOG SWITCH record
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		/* Pretend it extends to end of segment */
+		state->EndRecPtr += XLogSegSize - 1;
+		state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+	}
+
+	return record;
+
+err:
+	/*
+	 * Invalidate the xlog page we've cached. We might read from a different
+	 * source after failure.
+	 */
+	state->readSegNo = 0;
+	state->readOff = 0;
+	state->readLen = 0;
+
+	if (state->errormsg_buf[0] != '\0')
+		*errormsg = state->errormsg_buf;
+
+	return NULL;
+}
+
+/*
+ * Find the first record with at an lsn >= RecPtr.
+ *
+ * Useful for checking wether RecPtr is a valid xlog address for reading and to
+ * find the first valid address after some address when dumping records for
+ * debugging purposes.
+ */
+XLogRecPtr
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+{
+	XLogReaderState saved_state = *state;
+	XLogRecPtr	targetPagePtr;
+	XLogRecPtr	tmpRecPtr;
+	int	targetRecOff;
+	XLogRecPtr found = InvalidXLogRecPtr;
+	uint32		pageHeaderSize;
+	XLogPageHeader header;
+	XLogRecord *record;
+	uint32 readLen;
+	char	   *errormsg;
+
+	if (RecPtr == InvalidXLogRecPtr)
+		RecPtr = state->EndRecPtr;
+
+	targetRecOff = RecPtr % XLOG_BLCKSZ;
+
+	/* scroll back to page boundary */
+	targetPagePtr = RecPtr - targetRecOff;
+
+	/* Read the page containing the record */
+	readLen = ReadPageInternal(state, targetPagePtr, targetRecOff);
+	if (readLen < 0)
+		goto err;
+
+	header = (XLogPageHeader) state->readBuf;
+
+	pageHeaderSize = XLogPageHeaderSize(header);
+
+	/* make sure we have enough data for the page header */
+	readLen = ReadPageInternal(state, targetPagePtr, pageHeaderSize);
+	if (readLen < 0)
+		goto err;
+
+	/* skip over potential continuation data */
+	if (header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+	{
+		/* record headers are MAXALIGN'ed */
+		tmpRecPtr = targetPagePtr + pageHeaderSize
+			+ MAXALIGN(header->xlp_rem_len);
+	}
+	else
+	{
+		tmpRecPtr = targetPagePtr + pageHeaderSize;
+	}
+
+	/*
+	 * we know now that tmpRecPtr is an address pointing to a valid XLogRecord
+	 * because either were at the first record after the beginning of a page or
+	 * we just jumped over the remaining data of a continuation.
+	 */
+	while ((record = XLogReadRecord(state, tmpRecPtr, &errormsg)))
+	{
+		/* continue after the record */
+		tmpRecPtr = InvalidXLogRecPtr;
+
+		/* past the record we've found, break out */
+		if (XLByteLE(RecPtr, state->ReadRecPtr))
+		{
+			found = state->ReadRecPtr;
+			goto out;
+		}
+	}
+
+err:
+out:
+	/* Restore state to what we had before finding the record */
+	saved_state.readRecordBuf = state->readRecordBuf;
+	saved_state.readRecordBufSize = state->readRecordBufSize;
+	*state = saved_state;
+	return found;
+}
+
+/*
+ * Read a single xlog page including at least [pagestart, RecPtr] of valid data
+ * via the read_page() callback.
+ *
+ * Returns -1 if the required page cannot be read for some reason.
+ *
+ * We fetch the page from a reader-local cache if we know we have the required
+ * data and if there hasn't been any error since caching the data.
+ */
+static int
+ReadPageInternal(struct XLogReaderState *state, XLogRecPtr pageptr,
+				 int reqLen)
+{
+	int			readLen;
+	uint32		targetPageOff;
+	XLogSegNo	targetSegNo;
+	XLogPageHeader hdr;
+
+	Assert((pageptr % XLOG_BLCKSZ) == 0);
+
+	XLByteToSeg(pageptr, targetSegNo);
+	targetPageOff = (pageptr % XLogSegSize);
+
+	/* check whether we have all the requested data already */
+	if (targetSegNo == state->readSegNo && targetPageOff == state->readOff &&
+		reqLen < state->readLen)
+		return state->readLen;
+
+	/*
+	 * Data is not cached.
+	 *
+	 * Everytime we actually read the page, even if we looked at parts of it
+	 * before, we need to do verification as the read_page callback might now
+	 * be rereading data from a different source.
+	 *
+	 * Whenever switching to a new WAL segment, we read the first page of the
+	 * file and validate its header, even if that's not where the target record
+	 * is.  This is so that we can check the additional identification info
+	 * that is present in the first page's "long" header.
+	 */
+	if (targetSegNo != state->readSegNo &&
+		targetPageOff != 0)
+	{
+		XLogPageHeader hdr;
+		XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
+
+		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
+								   state->readBuf, &state->readPageTLI);
+
+		if (readLen < 0)
+			goto err;
+
+		Assert(readLen <= XLOG_BLCKSZ);
+
+		/* we can be sure to have enough WAL available, we scrolled back */
+		Assert(readLen == XLOG_BLCKSZ);
+
+		hdr = (XLogPageHeader) state->readBuf;
+
+		if (!ValidXLogPageHeader(state, targetSegmentPtr, hdr))
+			goto err;
+	}
+
+	/* now read the target data */
+	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
+							   state->readBuf, &state->readPageTLI);
+	if (readLen < 0)
+		goto err;
+
+	Assert(readLen <= XLOG_BLCKSZ);
+
+	/* check we have enough data to check for the actual length of a the page header */
+	if (readLen <= SizeOfXLogShortPHD)
+		goto err;
+
+	Assert(readLen >= reqLen);
+
+	hdr = (XLogPageHeader) state->readBuf;
+
+	/* still not enough */
+	if (readLen < XLogPageHeaderSize(hdr))
+	{
+		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
+								   state->readBuf, &state->readPageTLI);
+		if (readLen < 0)
+			goto err;
+	}
+
+	if (!ValidXLogPageHeader(state, pageptr, hdr))
+		goto err;
+
+	/* update cache information */
+	state->readSegNo = targetSegNo;
+	state->readOff = targetPageOff;
+	state->readLen = readLen;
+
+	return readLen;
+err:
+	state->readSegNo = 0;
+	state->readOff = 0;
+	state->readLen = 0;
+	return -1;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * XLogReadRecord.	It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
+					  XLogRecPtr PrevRecPtr, XLogRecord *record,
+					  bool randAccess)
+{
+	/*
+	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+	 * required.
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		if (record->xl_len != 0)
+		{
+			report_invalid_record(state,
+								  "invalid xlog switch record at %X/%X",
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			return false;
+		}
+	}
+	else if (record->xl_len == 0)
+	{
+		report_invalid_record(state,
+							  "record with zero length at %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		return false;
+	}
+	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+	{
+		report_invalid_record(state,
+							  "invalid record length at %X/%X",
+							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+		return false;
+	}
+	if (record->xl_rmid > RM_MAX_ID)
+	{
+		report_invalid_record(state,
+							  "invalid resource manager ID %u at %X/%X",
+							  record->xl_rmid, (uint32) (RecPtr >> 32),
+							  (uint32) RecPtr);
+		return false;
+	}
+	if (randAccess)
+	{
+		/*
+		 * We can't exactly verify the prev-link, but surely it should be less
+		 * than the record's own address.
+		 */
+		if (!XLByteLT(record->xl_prev, RecPtr))
+		{
+			report_invalid_record(state,
+								  "record with incorrect prev-link %X/%X at %X/%X",
+								  (uint32) (record->xl_prev >> 32),
+								  (uint32) record->xl_prev,
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			return false;
+		}
+	}
+	else
+	{
+		/*
+		 * Record's prev-link should exactly match our previous location. This
+		 * check guards against torn WAL pages where a stale but valid-looking
+		 * WAL record starts on a sector boundary.
+		 */
+		if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+		{
+			report_invalid_record(state,
+								  "record with incorrect prev-link %X/%X at %X/%X",
+								  (uint32) (record->xl_prev >> 32),
+								  (uint32) record->xl_prev,
+								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+
+/*
+ * CRC-check an XLOG record.  We do not believe the contents of an XLOG
+ * record (other than to the minimal extent of computing the amount of
+ * data to read in) until we've checked the CRCs.
+ *
+ * We assume all of the record (that is, xl_tot_len bytes) has been read
+ * into memory at *record.	Also, ValidXLogRecordHeader() has accepted the
+ * record's header, which means in particular that xl_tot_len is at least
+ * SizeOfXlogRecord, so it is safe to fetch xl_len.
+ */
+static bool
+ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
+{
+	pg_crc32	crc;
+	int			i;
+	uint32		len = record->xl_len;
+	BkpBlock	bkpb;
+	char	   *blk;
+	size_t		remaining = record->xl_tot_len;
+
+	/* First the rmgr data */
+	if (remaining < SizeOfXLogRecord + len)
+	{
+		/* ValidXLogRecordHeader() should've caught this already... */
+		report_invalid_record(state, "invalid record length at %X/%X",
+							  (uint32) (recptr >> 32), (uint32) recptr);
+		return false;
+	}
+	remaining -= SizeOfXLogRecord + len;
+	INIT_CRC32(crc);
+	COMP_CRC32(crc, XLogRecGetData(record), len);
+
+	/* Add in the backup blocks, if any */
+	blk = (char *) XLogRecGetData(record) + len;
+	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+	{
+		uint32		blen;
+
+		if (!(record->xl_info & XLR_BKP_BLOCK(i)))
+			continue;
+
+		if (remaining < sizeof(BkpBlock))
+		{
+			report_invalid_record(state,
+							  "invalid backup block size in record at %X/%X",
+								  (uint32) (recptr >> 32), (uint32) recptr);
+			return false;
+		}
+		memcpy(&bkpb, blk, sizeof(BkpBlock));
+
+		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+		{
+			report_invalid_record(state,
+								  "incorrect hole size in record at %X/%X",
+								  (uint32) (recptr >> 32), (uint32) recptr);
+			return false;
+		}
+		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+
+		if (remaining < blen)
+		{
+			report_invalid_record(state,
+							  "invalid backup block size in record at %X/%X",
+								  (uint32) (recptr >> 32), (uint32) recptr);
+			return false;
+		}
+		remaining -= blen;
+		COMP_CRC32(crc, blk, blen);
+		blk += blen;
+	}
+
+	/* Check that xl_tot_len agrees with our calculation */
+	if (remaining != 0)
+	{
+		report_invalid_record(state,
+							  "incorrect total length in record at %X/%X",
+							  (uint32) (recptr >> 32), (uint32) recptr);
+		return false;
+	}
+
+	/* Finally include the record header */
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+	FIN_CRC32(crc);
+
+	if (!EQ_CRC32(record->xl_crc, crc))
+	{
+		report_invalid_record(state,
+				 "incorrect resource manager data checksum in record at %X/%X",
+							  (uint32) (recptr >> 32), (uint32) recptr);
+		return false;
+	}
+
+	return true;
+}
+
+static bool
+ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
+					XLogPageHeader hdr)
+{
+	XLogRecPtr	recaddr;
+	XLogSegNo segno;
+	int32 offset;
+
+	Assert((recptr % XLOG_BLCKSZ) == 0);
+
+	XLByteToSeg(recptr, segno);
+	offset = recptr % XLogSegSize;
+
+	XLogSegNoOffsetToRecPtr(segno, offset, recaddr);
+
+	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
+	{
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, state->readPageTLI, segno);
+
+		report_invalid_record(state,
+					  "invalid magic number %04X in log segment %s, offset %u",
+							  hdr->xlp_magic,
+							  fname,
+							  offset);
+		return false;
+	}
+
+	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
+	{
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, state->readPageTLI, segno);
+
+		report_invalid_record(state,
+						"invalid info bits %04X in log segment %s, offset %u",
+							  hdr->xlp_info,
+							  fname,
+							  offset);
+		return false;
+	}
+
+	if (hdr->xlp_info & XLP_LONG_HEADER)
+	{
+		XLogLongPageHeader longhdr = (XLogLongPageHeader) hdr;
+
+		if (state->system_identifier &&
+		    longhdr->xlp_sysid != state->system_identifier)
+		{
+			char		fhdrident_str[32];
+			char		sysident_str[32];
+
+			/*
+			 * Format sysids separately to keep platform-dependent format code
+			 * out of the translatable message string.
+			 */
+			snprintf(fhdrident_str, sizeof(fhdrident_str), UINT64_FORMAT,
+					 longhdr->xlp_sysid);
+			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
+					 state->system_identifier);
+			report_invalid_record(state,
+					  "WAL file is from different database system: WAL file database system identifier is %s, pg_control database system identifier is %s.",
+								  fhdrident_str, sysident_str);
+			return false;
+		}
+		else if (longhdr->xlp_seg_size != XLogSegSize)
+		{
+			report_invalid_record(state,
+					  "WAL file is from different database system: Incorrect XLOG_SEG_SIZE in page header.");
+			return false;
+		}
+		else if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
+		{
+			report_invalid_record(state,
+					 "WAL file is from different database system: Incorrect XLOG_BLCKSZ in page header.");
+			return false;
+		}
+	}
+	else if (offset == 0)
+	{
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, state->readPageTLI, segno);
+
+		/* hmm, first page of file doesn't have a long header? */
+		report_invalid_record(state,
+					  "invalid info bits %04X in log segment %s, offset %u",
+							  hdr->xlp_info,
+							  fname,
+							  offset);
+		return false;
+	}
+
+	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
+	{
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, state->readPageTLI, segno);
+
+		report_invalid_record(state,
+			  "unexpected pageaddr %X/%X in log segment %s, offset %u",
+			  (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
+							  fname,
+							  offset);
+		return false;
+	}
+
+	/*
+	 * Check page TLI is one of the expected values.
+	 */
+	if (state->expectedTLEs != NIL &&
+		!tliInHistory(hdr->xlp_tli, state->expectedTLEs))
+	{
+		char		fname[MAXFNAMELEN];
+
+		XLogFileName(fname, state->readPageTLI, segno);
+
+		report_invalid_record(state,
+					"unexpected timeline ID %u in log segment %s, offset %u",
+							  hdr->xlp_tli,
+							  fname,
+							  offset);
+		return false;
+	}
+
+	/*
+	 * Since child timelines are always assigned a TLI greater than their
+	 * immediate parent's TLI, we should never see TLI go backwards across
+	 * successive pages of a consistent WAL sequence.
+	 *
+	 * Of course this check should only be applied when advancing sequentially
+	 * across pages; therefore ReadRecord resets lastPageTLI and lastSegmentTLI
+	 * to zero when going to a random page. FIXME
+	 *
+	 * Sometimes we re-read a segment that's already been (partially) read. So
+	 * we only verify TLIs for pages that are later than the last remembered
+	 * LSN.
+	 *
+	 * XXX: This is slightly less precise than the check we did in earlier
+	 * times. I don't see a problem with that though.
+	 */
+	if (state->latestReadPtr < recptr)
+	{
+		if (hdr->xlp_tli < state->latestReadTLI)
+		{
+			char		fname[MAXFNAMELEN];
+
+			XLogFileName(fname, state->readPageTLI, segno);
+
+			report_invalid_record(state,
+								  "out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
+								  hdr->xlp_tli,
+								  state->latestReadTLI,
+								  fname,
+								  offset);
+			return false;
+		}
+		state->latestReadPtr = recptr;
+		state->latestReadTLI = hdr->xlp_tli;
+	}
+	return true;
+}
diff --git a/src/backend/nls.mk b/src/backend/nls.mk
index 30f6a2b..0598e8f 100644
--- a/src/backend/nls.mk
+++ b/src/backend/nls.mk
@@ -4,12 +4,13 @@ AVAIL_LANGUAGES  = de es fr ja pt_BR tr zh_CN zh_TW
 GETTEXT_FILES    = + gettext-files
 GETTEXT_TRIGGERS = $(BACKEND_COMMON_GETTEXT_TRIGGERS) \
     GUC_check_errmsg GUC_check_errdetail GUC_check_errhint \
-    write_stderr yyerror parser_yyerror
+    write_stderr yyerror parser_yyerror report_corrupt_record
 GETTEXT_FLAGS    = $(BACKEND_COMMON_GETTEXT_FLAGS) \
     GUC_check_errmsg:1:c-format \
     GUC_check_errdetail:1:c-format \
     GUC_check_errhint:1:c-format \
-    write_stderr:1:c-format
+    write_stderr:1:c-format \
+    report_corrupt_record:2:c-format
 
 gettext-files: distprep
 	find $(srcdir)/ $(srcdir)/../port/ -name '*.c' -print | LC_ALL=C sort >$@
diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..86dace0 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = initdb pg_ctl pg_dump \
-	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+	psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup pg_xlogdump
 
 ifeq ($(PORTNAME), win32)
 SUBDIRS += pgevent
diff --git a/src/bin/pg_xlogdump/Makefile b/src/bin/pg_xlogdump/Makefile
new file mode 100644
index 0000000..ba126e9
--- /dev/null
+++ b/src/bin/pg_xlogdump/Makefile
@@ -0,0 +1,55 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_xlogdump/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/pg_xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -DFRONTEND $(CPPFLAGS)
+
+all: pg_xlogdump
+
+xlogreader.c: % : $(top_srcdir)/src/backend/access/transam/%
+	rm -f $@ && $(LN_S) $< .
+
+assert.c: % : $(top_srcdir)/src/backend/utils/error/%
+	rm -f $@ && $(LN_S) $< .
+
+rmgrdescfiles = clogdesc.c dbasedesc.c gindesc.c gistdesc.c hashdesc.c \
+	heapdesc.c mxactdesc.c nbtdesc.c relmapdesc.c seqdesc.c smgrdesc.c \
+	spgdesc.c standbydesc.c tblspcdesc.c xactdesc.c xlogdesc.c
+
+$(rmgrdescfiles): % : $(top_srcdir)/src/backend/access/rmgrdesc/%
+	rm -f $@ && $(LN_S) $< .
+
+OBJS = \
+	clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o assert.o \
+	$(WIN32RES) \
+	pg_xlogdump.o stdout_strinfo.o compat.o tables.o xlogreader.o \
+
+pg_xlogdump: $(OBJS) | submake-libpgport
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) $(libpq_pgport) -o $@$(X)
+
+install: all installdirs
+	$(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)/pg_xlogdump$(X)'
+
+installdirs:
+	$(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+	rm -f '$(DESTDIR)$(bindir)/pg_xlogdump$(X)'
+
+clean distclean maintainer-clean:
+	rm -f $(OBJS) pg_xlogdump xlogreader.c assert.c $(rmgrdescfiles)
diff --git a/src/bin/pg_xlogdump/compat.c b/src/bin/pg_xlogdump/compat.c
new file mode 100644
index 0000000..dd36e55
--- /dev/null
+++ b/src/bin/pg_xlogdump/compat.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * compat.c
+ *		Support functions for xlogdump.c
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/bin/pg_xlogdump/compat.c
+ *
+ * This file contains client-side implementations for various backend
+ * functions that the rm_desc functions in *desc.c files rely on.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* ugly hack, same as in e.g pg_controldata */
+#define FRONTEND 1
+#include "postgres.h"
+
+#include "access/timeline.h"
+#include "catalog/catalog.h"
+#include "datatype/timestamp.h"
+#include "storage/relfilenode.h"
+#include "utils/timestamp.h"
+
+bool assert_enabled = false;
+
+/*
+ * Returns true if 'expectedTLEs' contains a timeline with id 'tli'
+ */
+bool
+tliInHistory(TimeLineID tli, List *expectedTLEs)
+{
+	ListCell *cell;
+
+	foreach(cell, expectedTLEs)
+	{
+		if (((TimeLineHistoryEntry *) lfirst(cell))->tli == tli)
+			return true;
+	}
+
+	return false;
+}
+
+void
+pfree(void *a)
+{
+}
+
+
+const char *
+timestamptz_to_str(TimestampTz t)
+{
+	return "";
+}
+
+char *
+relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
+{
+	return NULL;
+}
+
+/*
+ * Write errors to stderr (or by equal means when stderr is
+ * not available).
+ */
+void
+write_stderr(const char *fmt,...)
+{
+	va_list		ap;
+
+	va_start(ap, fmt);
+#if !defined(WIN32) && !defined(__CYGWIN__)
+	/* On Unix, we just fprintf to stderr */
+	vfprintf(stderr, fmt, ap);
+#else
+
+	/*
+	 * On Win32, we print to stderr if running on a console, or write to
+	 * eventlog if running as a service
+	 */
+	if (!isatty(fileno(stderr)))	/* Running as a service */
+	{
+		char		errbuf[2048];		/* Arbitrary size? */
+
+		vsnprintf(errbuf, sizeof(errbuf), fmt, ap);
+
+		write_eventlog(EVENTLOG_ERROR_TYPE, errbuf);
+	}
+	else
+		/* Not running as service, write to stderr */
+		vfprintf(stderr, fmt, ap);
+#endif
+	va_end(ap);
+}
diff --git a/src/bin/pg_xlogdump/nls.mk b/src/bin/pg_xlogdump/nls.mk
new file mode 100644
index 0000000..3a981f5
--- /dev/null
+++ b/src/bin/pg_xlogdump/nls.mk
@@ -0,0 +1,4 @@
+# src/bin/pg_xlogdump/nls.mk
+CATALOG_NAME     = pg_xlogdump
+AVAIL_LANGUAGES  =
+GETTEXT_FILES    = pg_xlogdump.c
diff --git a/src/bin/pg_xlogdump/pg_xlogdump.c b/src/bin/pg_xlogdump/pg_xlogdump.c
new file mode 100644
index 0000000..6b24da6
--- /dev/null
+++ b/src/bin/pg_xlogdump/pg_xlogdump.c
@@ -0,0 +1,455 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_xlogdump.c - decode and display WAL
+ *
+ * Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_xlogdump/pg_xlogdump.c
+ *-------------------------------------------------------------------------
+ */
+
+/* ugly hack, same as in e.g pg_controldata */
+#define FRONTEND 1
+#include "postgres.h"
+
+#include <unistd.h>
+#include <libgen.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "catalog/catalog.h"
+#include "pg_config_manual.h"
+#include "utils/elog.h"
+
+#include "getopt_long.h"
+
+static const char *progname;
+
+typedef struct XLogDumpPrivateData
+{
+	TimeLineID	timeline;
+	char	   *outpath;
+	char	   *inpath;
+	char	   *file;
+	XLogRecPtr	startptr;
+	XLogRecPtr	endptr;
+
+	bool		bkp_details;
+} XLogDumpPrivateData;
+
+static void fatal_error(const char *fmt, ...)
+__attribute__((format(PG_PRINTF_ATTRIBUTE, 1, 2)));
+
+static void fatal_error(const char *fmt, ...)
+{
+	va_list		args;
+	fflush(stdout);
+
+	fprintf(stderr, "fatal_error: ");
+	va_start(args, fmt);
+	vfprintf(stderr, fmt, args);
+	va_end(args);
+	fprintf(stderr, "\n");
+	exit(1);
+}
+
+static void
+XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
+				 XLogRecPtr startptr, char *buf, Size count);
+
+/* this should probably be put in a general implementation */
+static void
+XLogDumpXLogRead(const char *directory, TimeLineID timeline_id,
+				 XLogRecPtr startptr, char *buf, Size count)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	static int	sendFile = -1;
+	static XLogSegNo sendSegNo = 0;
+	static uint32 sendOff = 0;
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	while (nbytes > 0)
+	{
+		uint32		startoff;
+		int			segbytes;
+		int			readbytes;
+
+		startoff = recptr % XLogSegSize;
+
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+		{
+			char		fname[MAXFNAMELEN];
+			char		fpath[MAXPGPATH];
+
+			/* Switch to another logfile segment */
+			if (sendFile >= 0)
+				close(sendFile);
+
+			XLByteToSeg(recptr, sendSegNo);
+
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			snprintf(fpath, MAXPGPATH, "%s/%s",
+					 (directory == NULL) ? XLOGDIR : directory, fname);
+
+			sendFile = open(fpath, O_RDONLY, 0);
+			if (sendFile < 0)
+			{
+				/*
+				 * If the file is not found, assume it's because the standby
+				 * asked for a too old WAL segment that has already been
+				 * removed or recycled.
+				 */
+				if (errno == ENOENT)
+					fatal_error("requested WAL segment %s has already been removed",
+								fname);
+				else
+					fatal_error("could not open file \"%s\": %u",
+								fpath, errno);
+			}
+			sendOff = 0;
+		}
+
+		/* Need to seek in the file? */
+		if (sendOff != startoff)
+		{
+			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+				char fname[MAXPGPATH];
+				XLogFileName(fname, timeline_id, sendSegNo);
+
+				fatal_error("could not seek in log segment %s to offset %u: %d",
+							fname,
+							startoff,
+							errno);
+			}
+			sendOff = startoff;
+		}
+
+		/* How many bytes are within this segment? */
+		if (nbytes > (XLogSegSize - startoff))
+			segbytes = XLogSegSize - startoff;
+		else
+			segbytes = nbytes;
+
+		readbytes = read(sendFile, p, segbytes);
+		if (readbytes <= 0)
+		{
+			char fname[MAXPGPATH];
+			XLogFileName(fname, timeline_id, sendSegNo);
+
+			fatal_error("could not read from log segment %s, offset %u, length %lu: %d",
+						fname,
+						sendOff, (unsigned long) segbytes, errno);
+		}
+
+		/* Update state for read */
+		XLByteAdvance(recptr, readbytes);
+
+		sendOff += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+}
+
+static int
+XLogDumpReadPage(XLogReaderState* state, XLogRecPtr targetPagePtr, int reqLen,
+				 char *readBuff, TimeLineID *curFileTLI)
+{
+	XLogDumpPrivateData *private = state->private_data;
+	int			count = XLOG_BLCKSZ;
+
+	if (private->endptr != InvalidXLogRecPtr)
+	{
+		if (targetPagePtr > private->endptr)
+			return -1;
+
+		if (targetPagePtr + reqLen > private->endptr)
+			count = private->endptr - targetPagePtr;
+	}
+
+	XLogDumpXLogRead(private->inpath, private->timeline, targetPagePtr,
+					 readBuff, count);
+
+	return count;
+}
+
+static void
+XLogDumpDisplayRecord(XLogReaderState* state, XLogRecord* record)
+{
+	XLogDumpPrivateData *config = (XLogDumpPrivateData *)state->private_data;
+	const RmgrData *rmgr = &RmgrTable[record->xl_rmid];
+
+	fprintf(stdout, "xlog record: rmgr: %-11s, record_len: %6u, tot_len: %6u, tx: %10u, lsn: %X/%08X, prev %X/%08X, bkp: %u%u%u%u, desc:",
+			rmgr->rm_name,
+			record->xl_len, record->xl_tot_len,
+			record->xl_xid,
+			(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr,
+			(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+			!!(XLR_BKP_BLOCK(0) & record->xl_info),
+			!!(XLR_BKP_BLOCK(1) & record->xl_info),
+			!!(XLR_BKP_BLOCK(2) & record->xl_info),
+			!!(XLR_BKP_BLOCK(3) & record->xl_info));
+
+	/* the desc routine will printf the description directly to stdout */
+	rmgr->rm_desc(NULL, record->xl_info, XLogRecGetData(record));
+
+	fprintf(stdout, "\n");
+
+	if (config->bkp_details)
+	{
+		int off;
+		char *blk = (char *) XLogRecGetData(record) + record->xl_len;
+
+		for (off = 0; off < XLR_MAX_BKP_BLOCKS; off++)
+		{
+			BkpBlock	bkpb;
+
+			if (!(XLR_BKP_BLOCK(off) & record->xl_info))
+				continue;
+
+			memcpy(&bkpb, blk, sizeof(BkpBlock));
+			blk += sizeof(BkpBlock);
+
+			fprintf(stdout, "\tbackup bkp #%u; rel %u/%u/%u; fork: %s; block: %u; hole: offset: %u, length: %u\n",
+					off, bkpb.node.spcNode, bkpb.node.dbNode, bkpb.node.relNode,
+					forkNames[bkpb.fork], bkpb.block, bkpb.hole_offset, bkpb.hole_length);
+		}
+	}
+}
+
+static void
+usage(const char *progname)
+{
+	printf(_("%s: reads/writes postgres transaction logs for debugging.\n\n"),
+		   progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]...\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -b, --bkp-details      output detailed information about backup blocks\n"));
+	printf(_("  -e, --end RECPTR       read wal up to RECPTR\n"));
+	printf(_("  -f, --file FILE        wal file to parse, cannot be specified together with -p\n"));
+	printf(_("  -h, --help             show this help, then exit\n"));
+	printf(_("  -p, --path PATH        from where do we want to read? cwd/pg_xlog is the default\n"));
+	printf(_("  -s, --start RECPTR     read wal in directory indicated by -p starting at RECPTR\n"));
+	printf(_("  -t, --timeline TLI     which timeline do we want to read, defaults to 1\n"));
+	printf(_("  -v, --version          output version information, then exit\n"));
+}
+
+int
+main(int argc, char **argv)
+{
+	uint32		xlogid;
+	uint32		xrecoff;
+	XLogReaderState *xlogreader_state;
+	XLogDumpPrivateData private;
+	XLogRecord *record;
+
+	static struct option long_options[] = {
+		{"bkp-details", no_argument, NULL, 'b'},
+		{"end", required_argument, NULL, 'e'},
+		{"file", required_argument, NULL, 'f'},
+		{"help", no_argument, NULL, '?'},
+		{"path", required_argument, NULL, 'p'},
+		{"start", required_argument, NULL, 's'},
+		{"timeline", required_argument, NULL, 't'},
+		{"version", no_argument, NULL, 'V'},
+		{NULL, 0, NULL, 0}
+	};
+
+	int			c;
+	int			option_index;
+
+
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_xlogdump"));
+
+	progname = get_progname(argv[0]);
+
+
+	memset(&private, 0, sizeof(XLogDumpPrivateData));
+
+	private.timeline = 1;
+	private.bkp_details = false;
+	private.startptr = InvalidXLogRecPtr;
+	private.endptr = InvalidXLogRecPtr;
+
+	if (argc <= 1)
+	{
+		fprintf(stderr, _("%s: no arguments specified\n"), progname);
+		goto bad_argument;
+	}
+
+	while ((c = getopt_long(argc, argv, "be:f:hp:s:t:V",
+							long_options, &option_index)) != -1)
+	{
+		switch (c)
+		{
+			case 'b':
+				private.bkp_details = true;
+				break;
+			case 'e':
+				if (sscanf(optarg, "%X/%X", &xlogid, &xrecoff) != 2)
+				{
+					fprintf(stderr, _("%s: couldn't parse -e %s\n"),
+							progname, optarg);
+					goto bad_argument;
+				}
+				else
+					private.endptr = (uint64)xlogid << 32 | xrecoff;
+				break;
+			case 'f':
+				private.file = strdup(optarg);
+				break;
+			case '?':
+				usage(progname);
+				exit(0);
+				break;
+			case 'p':
+				private.inpath = strdup(optarg);
+				break;
+			case 's':
+				if (sscanf(optarg, "%X/%X", &xlogid, &xrecoff) != 2)
+				{
+					fprintf(stderr, _("%s: couldn't parse -s %s\n"),
+							progname, optarg);
+					goto bad_argument;
+				}
+				else
+					private.startptr = (uint64)xlogid << 32 | xrecoff;
+				break;
+			case 't':
+				if (sscanf(optarg, "%d", &private.timeline) != 1)
+				{
+					fprintf(stderr, _("%s: couldn't parse timeline -t %s\n"),
+							progname, optarg);
+					goto bad_argument;
+				}
+				break;
+			case 'V':
+				printf("%s: (PostgreSQL): %s\n", progname, PG_VERSION);
+				exit(0);
+				break;
+			default:
+				fprintf(stderr, _("%s: unknown argument -%c passed\n"),
+						progname, c);
+				goto bad_argument;
+				break;
+		}
+	}
+
+	/* some parameter was badly specified, don't output further errors */
+	if (optind < argc)
+	{
+		fprintf(stderr,
+				_("%s: too many command-line arguments (first is \"%s\")\n"),
+				progname, argv[optind]);
+		goto bad_argument;
+	}
+	else if (private.inpath != NULL && private.file != NULL)
+	{
+		fprintf(stderr,
+				_("%s: only one of -p or -f can be specified\n"),
+				progname);
+		goto bad_argument;
+	}
+	/* no file specified, but no range of of interesting data either */
+	else if (private.file == NULL && XLByteEQ(private.startptr, InvalidXLogRecPtr))
+	{
+		fprintf(stderr,
+				_("%s: no -s given in range mode.\n"),
+				progname);
+		goto bad_argument;
+	}
+	/* everything ok, do some more setup */
+	else
+	{
+		/* default value */
+		if (private.file == NULL && private.inpath == NULL)
+			private.inpath = "pg_xlog";
+
+		/* XXX: validate directory */
+
+		/* default value */
+		if (private.file != NULL)
+		{
+			XLogSegNo segno;
+
+			/* FIXME: can we rely on basename? */
+			XLogFromFileName(basename(private.file), &private.timeline, &segno);
+			private.inpath = strdup(dirname(private.file));
+
+			if (XLByteEQ(private.startptr, InvalidXLogRecPtr))
+				XLogSegNoOffsetToRecPtr(segno, 0, private.startptr);
+			else if (!XLByteInSeg(private.startptr, segno))
+			{
+				fprintf(stderr,
+						_("%s: -s does not lie inside file \"%s\"\n"),
+						progname,
+						private.file);
+				goto bad_argument;
+			}
+
+			if (XLByteEQ(private.endptr, InvalidXLogRecPtr))
+				XLogSegNoOffsetToRecPtr(segno + 1, 0, private.endptr);
+			else if (!XLByteInSeg(private.endptr, segno) &&
+					 private.endptr != (segno + 1) * XLogSegSize)
+			{
+				fprintf(stderr,
+						_("%s: -e does not lie inside file \"%s\"\n"),
+						progname, private.file);
+				goto bad_argument;
+			}
+		}
+	}
+
+	/* we have everything we need, continue */
+	{
+		XLogRecPtr first_record;
+		char	*errormsg;
+
+		xlogreader_state = XLogReaderAllocate(private.startptr,
+											  XLogDumpReadPage,
+											  &private);
+
+		/* first find a valid recptr to start from */
+		first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+
+		if (first_record == InvalidXLogRecPtr)
+			fatal_error("Could not find a valid record after %X/%X",
+						(uint32) (private.startptr >> 32),
+						(uint32) private.startptr);
+
+		/*
+		 * Display a message that were skipping data if `from` wasn't a pointer
+		 * to the start of a record and also wasn't a pointer to the beginning
+		 * of a segment (e.g. we were used in file mode).
+		 */
+		if (first_record != private.startptr && (private.startptr % XLogSegSize) != 0)
+			fprintf(stdout, "first record is after %X/%X, at %X/%X, skipping over %u bytes\n",
+					(uint32) (private.startptr >> 32), (uint32) private.startptr,
+					(uint32) (first_record >> 32), (uint32) first_record,
+					(uint32) (first_record - private.endptr));
+
+		while ((record = XLogReadRecord(xlogreader_state, first_record, &errormsg)))
+		{
+			/* continue after the last record */
+			first_record = InvalidXLogRecPtr;
+			XLogDumpDisplayRecord(xlogreader_state, record);
+		}
+		if (errormsg)
+			fprintf(stderr, "error in WAL record: %s\n", errormsg);
+
+		XLogReaderFree(xlogreader_state);
+	}
+
+	return 0;
+bad_argument:
+	fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+	exit(1);
+}
diff --git a/src/bin/pg_xlogdump/stdout_strinfo.c b/src/bin/pg_xlogdump/stdout_strinfo.c
new file mode 100644
index 0000000..07e6d84
--- /dev/null
+++ b/src/bin/pg_xlogdump/stdout_strinfo.c
@@ -0,0 +1,23 @@
+/*
+ * A client-side StringInfo implementation that just prints everything to
+ * stdout
+ */
+#include "postgres_fe.h"
+
+#include "lib/stringinfo.h"
+
+void
+appendStringInfo(StringInfo str, const char *fmt, ...)
+{
+	va_list		args;
+
+	va_start(args, fmt);
+	vprintf(fmt, args);
+	va_end(args);
+}
+
+void
+appendStringInfoString(StringInfo str, const char *string)
+{
+	appendStringInfo(str, "%s", string);
+}
diff --git a/src/bin/pg_xlogdump/tables.c b/src/bin/pg_xlogdump/tables.c
new file mode 100644
index 0000000..b3a7dca
--- /dev/null
+++ b/src/bin/pg_xlogdump/tables.c
@@ -0,0 +1,78 @@
+/*-------------------------------------------------------------------------
+ *
+ * tables.c
+ *		Support data for xlogdump.c
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/bin/pg_xlogdump/tables.c
+ *
+ * NOTES
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * rmgr.c
+ *
+ * Resource managers definition
+ *
+ * src/backend/access/transam/rmgr.c
+ */
+#include "postgres.h"
+
+#include "access/clog.h"
+#include "access/gin.h"
+#include "access/gist_private.h"
+#include "access/hash.h"
+#include "access/heapam_xlog.h"
+#include "access/multixact.h"
+#include "access/nbtree.h"
+#include "access/spgist.h"
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "catalog/storage_xlog.h"
+#include "commands/dbcommands.h"
+#include "commands/sequence.h"
+#include "commands/tablespace.h"
+#include "storage/standby.h"
+#include "utils/relmapper.h"
+#include "catalog/catalog.h"
+
+/*
+ * Table of fork names.
+ *
+ * needs to be synced with src/backend/catalog/catalog.c
+ */
+const char *forkNames[] = {
+	"main",						/* MAIN_FORKNUM */
+	"fsm",						/* FSM_FORKNUM */
+	"vm",						/* VISIBILITYMAP_FORKNUM */
+	"init"						/* INIT_FORKNUM */
+};
+
+/*
+ * RmgrTable linked only to functions available outside of the backend.
+ *
+ * needs to be synced with src/backend/access/transam/rmgr.c
+ */
+const RmgrData RmgrTable[RM_MAX_ID + 1] = {
+	{"XLOG", NULL, xlog_desc, NULL, NULL, NULL},
+	{"Transaction", NULL, xact_desc, NULL, NULL, NULL},
+	{"Storage", NULL, smgr_desc, NULL, NULL, NULL},
+	{"CLOG", NULL, clog_desc, NULL, NULL, NULL},
+	{"Database", NULL, dbase_desc, NULL, NULL, NULL},
+	{"Tablespace", NULL, tblspc_desc, NULL, NULL, NULL},
+	{"MultiXact", NULL, multixact_desc, NULL, NULL, NULL},
+	{"RelMap", NULL, relmap_desc, NULL, NULL, NULL},
+	{"Standby", NULL, standby_desc, NULL, NULL, NULL},
+	{"Heap2", NULL, heap2_desc, NULL, NULL, NULL},
+	{"Heap", NULL, heap_desc, NULL, NULL, NULL},
+	{"Btree", NULL, btree_desc, NULL, NULL, NULL},
+	{"Hash", NULL, hash_desc, NULL, NULL, NULL},
+	{"Gin", NULL, gin_desc, NULL, NULL, NULL},
+	{"Gist", NULL, gist_desc, NULL, NULL, NULL},
+	{"Sequence", NULL, seq_desc, NULL, NULL, NULL},
+	{"SPGist", NULL, spg_desc, NULL, NULL, NULL}
+};
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..6a1c060
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,136 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ *		Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/access/xlogreader.h
+ *
+ * NOTES
+ *		Check the definition of the XLogReaderState struct for instructions on
+ *		how to use the XLogReader infrastructure.
+ *
+ *		The basic idea is to allocate an XLogReaderState via
+ *		XLogReaderAllocate, and call XLogReadRecord() until it returns NULL.
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGREADER_H
+#define XLOGREADER_H
+
+#include "access/xlog_internal.h"
+#include "nodes/pg_list.h"
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+
+typedef int (*XLogPageReadCB) (struct XLogReaderState *state,
+							   XLogRecPtr pageptr,
+							   int reqLen,
+							   char *readBuf,
+							   TimeLineID *pageTLI);
+
+typedef struct XLogReaderState
+{
+	/* ----------------------------------------
+	 * Public parameters
+	 * ----------------------------------------
+	 */
+
+	/*
+	 * Data input callback (mandatory).
+	 *
+	 * This callback shall read the the xlog page (of size XLOG_BLKSZ) in which
+	 * RecPtr resides. All data <= RecPtr must be visible. The callback shall
+	 * return the range of actually valid bytes returned or -1 upon
+	 * failure.
+	 *
+	 * *pageTLI should be set to the TLI of the file the page was read from
+	 * to be in. It is currently used only for error reporting purposes, to
+	 * reconstruct the name of the WAL file where an error occurred.
+	 */
+	XLogPageReadCB read_page;
+
+	/*
+	 * System identifier of the xlog files were about to read.
+	 *
+	 * Set to zero (the default value) if unknown or unimportant.
+	 */
+	uint64		system_identifier;
+
+	/*
+	 * List of acceptable TLIs.
+	 *
+	 * Set to NIL (the default value) if this should not be checked.
+	 */
+	List	   *expectedTLEs;
+
+	/*
+	 * Opaque data for callbacks to use.  Not used by XLogReader.
+	 */
+	void	   *private_data;
+
+	/*
+	 * From where to where are we reading
+	 */
+	XLogRecPtr	ReadRecPtr;		/* start of last record read */
+	XLogRecPtr	EndRecPtr;		/* end+1 of last record read */
+
+	/* ----------------------------------------
+	 * private/internal state
+	 * ----------------------------------------
+	 */
+
+	/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+	char	   *readBuf;
+
+	/* last read segment, segment offset, read length, TLI */
+	XLogSegNo   readSegNo;
+	uint32      readOff;
+	uint32      readLen;
+	TimeLineID  readPageTLI;
+
+	/* Highest TLI we have read so far  */
+	TimeLineID	latestReadTLI;
+	XLogRecPtr	latestReadPtr;
+
+	/* Buffer for current ReadRecord result (expandable) */
+	char	   *readRecordBuf;
+	uint32		readRecordBufSize;
+
+	/* Buffer to hold error message */
+	char	   *errormsg_buf;
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState *XLogReaderAllocate(XLogRecPtr startpoint,
+				   XLogPageReadCB pagereadfunc, void *private_data);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState *state);
+
+/*
+ * Read the next record from xlog. Returns NULL on end-of-WAL or on failure.
+ */
+extern XLogRecord *XLogReadRecord(XLogReaderState *state, XLogRecPtr ptr,
+			   char **errormsg);
+
+/*
+ * Find the address of the next record with a lsn >= RecPtr.
+ */
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+
+#endif   /* XLOGREADER_H */

#92

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#90)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:

I've been molding this patch for a while now, here's what I have this far
(also available in my git repository).

On a very quick this looks good. I will try to rebase the decoding stuff
and read a bit around in the course of that...

The biggest change is in the error reporting. A stand-alone program that
wants to use xlogreader.c no longer has to provide a full-blown replacement
for ereport(). The only thing that xlogreader.c used ereport() was when it
encounters an invalid record. And even there we had the
emode_for_corrupt_record hack. I think it's a much better API that
XLogReadRecord just returns NULL on an invalid record, and an error string,
and the caller can do what it wants with that. In xlog.c, we'll pass the
error string to ereport(), with the right emode as determined by
emode_for_corrupt_record. xlog.c is no longer concerned with
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically
insane to imagine that you can carve out a non-trivial piece of the backend
that doesn't contain any elog calls."
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
having done just that, it doesn't seem insane to me. xlogreader.c really is
a pretty well contained piece of code. All the complicated stuff that
contains elog calls and pallocs and more is in the callback, which can
freely use all the normal backend infrastructure.

This is pretty good. I was a bit afraid of making this change - thus my
really ugly emode callback hack - but this is way better.

Now, here's some stuff that still need to be done:

* A stand-alone program using xlogreader.c has to provide an implementation
of tliInHistory(). Need to find a better way to do that. Perhaps "#ifndef
FRONTEND" the tliInHistory checks in xlogreader.

We could just leave it in xlogreader in the first place. Having an
#ifdef'ed out version in there seems to be schizophrenic to me, all the
maintenance overhead, none of the fun...

* In xlog.c, some of the variables that used to be statics like
readFile, readOff etc. are now in the XLogPageReadPrivate struct. But
there's still plenty of statics left in there - it would certainly not
work correctly if xlog.c tried to open two xlog files at the same
time. I think it's just confusing to have some stuff in the
XLogPageReadPrivate struct, and others as static, so I think we should
get rid of XLogPageReadPrivate struct altogether and put back the
static variables. At least it would make the diff smaller, which might
help with reviewing. xlog.c probably doesn't need to provide a
"private" struct to xlogreader.c at all, which is okay.

Fine with me. I find the grouping that the struct provides somewhat
helpful when reading the code, but its more than offset by duplicating
some of the variables.
The reasons to have it are fewer compared when you'd introduced the
struct - xlogreader does more now, so less needs to be handled outside.

* It's pretty ugly that to use the rm_desc functions, you have to provide
dummy implementations of a bunch of backend functions, including pfree() and
timestamptz_to_str(). Should find a better way to do that.

I think most of the cases requiring those ugly hacks can be fixed to
just use a caller-provided buffer, there's not that much left.

timestamptz_to_str() is probably the most complex case. I just noticed
there's already a second implementation in
ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
a number of cases :(

* It's not clear to me how we'd handle translating the strings in
xlogreader.c, when xlogreader.c is used in a stand-alone program like
pg_xlogdump. Maybe we can just punt on that...

I personally would have no problem with that. It's probably either going
to be used during in-depth-debugging or when developing pg. In both
cases english seems to be fine. But I seldomly use translated programs
so maybe I am not the right person to ask.

* How about we move pg_xlogdump to contrib? It doesn't feel like the kind of
essential tool that deserves to be in src/bin.

contrib would be fine, but I think src/bin is better. There have been
quite some bugs by now where it would have been useful to have a
reliable xlogdump in core so its really installed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#92)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-12-11 15:44:39 +0100, Andres Freund wrote:

On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:

* It's pretty ugly that to use the rm_desc functions, you have to provide
dummy implementations of a bunch of backend functions, including pfree() and
timestamptz_to_str(). Should find a better way to do that.

I think most of the cases requiring those ugly hacks can be fixed to
just use a caller-provided buffer, there's not that much left.

timestamptz_to_str() is probably the most complex case. I just noticed
there's already a second implementation in
ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
a number of cases :(

The attached (and pushed) patches change relpathbackend to use a static buffer
instead. That gets rid of the pfree() requirement and looks ok otherwise as
well.

Unfortunately that still leaves us with the need to re-implement
relpathbackend() in xlogdump, but that seems somwhat ok to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#93)

2 attachment(s)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-12-11 16:43:12 +0100, Andres Freund wrote:

On 2012-12-11 15:44:39 +0100, Andres Freund wrote:

On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:

* It's pretty ugly that to use the rm_desc functions, you have to provide
dummy implementations of a bunch of backend functions, including pfree() and
timestamptz_to_str(). Should find a better way to do that.

I think most of the cases requiring those ugly hacks can be fixed to
just use a caller-provided buffer, there's not that much left.

timestamptz_to_str() is probably the most complex case. I just noticed
there's already a second implementation in
ecpg/pgtypeslib/dt_common.c. Yuck. It seems to already have diverged in
a number of cases :(

The attached (and pushed) patches change relpathbackend to use a static buffer
instead. That gets rid of the pfree() requirement and looks ok otherwise as
well.

... really attached.

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Make-relpathbackend-return-a-statically-result-inste.patchtext/x-patch; charset=us-asciiDownload

>From 1ec1700605bc8e69e2df1ff08d1359d90c9a1ca3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Dec 2012 16:36:06 +0100
Subject: [PATCH 1/2] Make relpathbackend return a statically result instead
 of palloc()'ing it

relpathbackend() (via some of its wrappers) is used in *_desc routines which we
want to be useable without a backend environment arround.

Change signature to return a 'const char *' to make misuse easier to
detect. That necessicates also changing the 'FileName' typedef to 'const char
*' which seems to be a good idea anyway.
---
 src/backend/access/rmgrdesc/smgrdesc.c |    6 ++--
 src/backend/access/rmgrdesc/xactdesc.c |    6 ++--
 src/backend/access/transam/xlogutils.c |    9 ++----
 src/backend/catalog/catalog.c          |   48 ++++++++++----------------------
 src/backend/storage/buffer/bufmgr.c    |   12 +++-----
 src/backend/storage/smgr/md.c          |   23 +++++----------
 src/backend/utils/adt/dbsize.c         |    4 +--
 src/include/catalog/catalog.h          |    2 +-
 src/include/storage/fd.h               |    2 +-
 9 files changed, 37 insertions(+), 75 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 40b9708..da53a48 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -26,19 +26,17 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
 	if (info == XLOG_SMGR_CREATE)
 	{
 		xl_smgr_create *xlrec = (xl_smgr_create *) rec;
-		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+		const char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
 
 		appendStringInfo(buf, "file create: %s", path);
-		pfree(path);
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec;
-		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+		const char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
 
 		appendStringInfo(buf, "file truncate: %s to %u blocks", path,
 						 xlrec->blkno);
-		pfree(path);
 	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 60deddc..7cad3e5 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -35,10 +35,9 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
 		appendStringInfo(buf, "; rels:");
 		for (i = 0; i < xlrec->nrels; i++)
 		{
-			char	   *path = relpathperm(xlrec->xnodes[i], MAIN_FORKNUM);
+			const char *path = relpathperm(xlrec->xnodes[i], MAIN_FORKNUM);
 
 			appendStringInfo(buf, " %s", path);
-			pfree(path);
 		}
 	}
 	if (xlrec->nsubxacts > 0)
@@ -105,10 +104,9 @@ xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec)
 		appendStringInfo(buf, "; rels:");
 		for (i = 0; i < xlrec->nrels; i++)
 		{
-			char	   *path = relpathperm(xlrec->xnodes[i], MAIN_FORKNUM);
+			const char *path = relpathperm(xlrec->xnodes[i], MAIN_FORKNUM);
 
 			appendStringInfo(buf, " %s", path);
-			pfree(path);
 		}
 	}
 	if (xlrec->nsubxacts > 0)
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5676120..43c7254 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -57,7 +57,7 @@ static void
 report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
 					BlockNumber blkno, bool present)
 {
-	char	   *path = relpathperm(node, forkno);
+	const char *path = relpathperm(node, forkno);
 
 	if (present)
 		elog(elevel, "page %u of relation %s is uninitialized",
@@ -65,7 +65,6 @@ report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
 	else
 		elog(elevel, "page %u of relation %s does not exist",
 			 blkno, path);
-	pfree(path);
 }
 
 /* Log a reference to an invalid page */
@@ -153,11 +152,10 @@ forget_invalid_pages(RelFileNode node, ForkNumber forkno, BlockNumber minblkno)
 		{
 			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
 			{
-				char	   *path = relpathperm(hentry->key.node, forkno);
+				const char *path = relpathperm(hentry->key.node, forkno);
 
 				elog(DEBUG2, "page %u of relation %s has been dropped",
 					 hentry->key.blkno, path);
-				pfree(path);
 			}
 
 			if (hash_search(invalid_page_tab,
@@ -186,11 +184,10 @@ forget_invalid_pages_db(Oid dbid)
 		{
 			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
 			{
-				char	   *path = relpathperm(hentry->key.node, hentry->key.forkno);
+				const char *path = relpathperm(hentry->key.node, hentry->key.forkno);
 
 				elog(DEBUG2, "page %u of relation %s has been dropped",
 					 hentry->key.blkno, path);
-				pfree(path);
 			}
 
 			if (hash_search(invalid_page_tab,
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 79b71b3..21c512e 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -112,54 +112,45 @@ forkname_chars(const char *str, ForkNumber *fork)
 /*
  * relpathbackend - construct path to a relation's file
  *
- * Result is a palloc'd string.
+ * Result is a pointer to a statically allocated string.
  */
-char *
+const char *
 relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
 {
-	int			pathlen;
-	char	   *path;
+	static char path[MAXPGPATH];
 
 	if (rnode.spcNode == GLOBALTABLESPACE_OID)
 	{
 		/* Shared system relations live in {datadir}/global */
 		Assert(rnode.dbNode == 0);
 		Assert(backend == InvalidBackendId);
-		pathlen = 7 + OIDCHARS + 1 + FORKNAMECHARS + 1;
-		path = (char *) palloc(pathlen);
 		if (forknum != MAIN_FORKNUM)
-			snprintf(path, pathlen, "global/%u_%s",
+			snprintf(path, MAXPGPATH, "global/%u_%s",
 					 rnode.relNode, forkNames[forknum]);
 		else
-			snprintf(path, pathlen, "global/%u", rnode.relNode);
+			snprintf(path, MAXPGPATH, "global/%u", rnode.relNode);
 	}
 	else if (rnode.spcNode == DEFAULTTABLESPACE_OID)
 	{
 		/* The default tablespace is {datadir}/base */
 		if (backend == InvalidBackendId)
 		{
-			pathlen = 5 + OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
-			path = (char *) palloc(pathlen);
 			if (forknum != MAIN_FORKNUM)
-				snprintf(path, pathlen, "base/%u/%u_%s",
+				snprintf(path, MAXPGPATH, "base/%u/%u_%s",
 						 rnode.dbNode, rnode.relNode,
 						 forkNames[forknum]);
 			else
-				snprintf(path, pathlen, "base/%u/%u",
+				snprintf(path, MAXPGPATH, "base/%u/%u",
 						 rnode.dbNode, rnode.relNode);
 		}
 		else
 		{
-			/* OIDCHARS will suffice for an integer, too */
-			pathlen = 5 + OIDCHARS + 2 + OIDCHARS + 1 + OIDCHARS + 1
-				+ FORKNAMECHARS + 1;
-			path = (char *) palloc(pathlen);
 			if (forknum != MAIN_FORKNUM)
-				snprintf(path, pathlen, "base/%u/t%d_%u_%s",
+				snprintf(path, MAXPGPATH, "base/%u/t%d_%u_%s",
 						 rnode.dbNode, backend, rnode.relNode,
 						 forkNames[forknum]);
 			else
-				snprintf(path, pathlen, "base/%u/t%d_%u",
+				snprintf(path, MAXPGPATH, "base/%u/t%d_%u",
 						 rnode.dbNode, backend, rnode.relNode);
 		}
 	}
@@ -168,38 +159,31 @@ relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
 		/* All other tablespaces are accessed via symlinks */
 		if (backend == InvalidBackendId)
 		{
-			pathlen = 9 + 1 + OIDCHARS + 1
-				+ strlen(TABLESPACE_VERSION_DIRECTORY) + 1 + OIDCHARS + 1
-				+ OIDCHARS + 1 + FORKNAMECHARS + 1;
-			path = (char *) palloc(pathlen);
 			if (forknum != MAIN_FORKNUM)
-				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u_%s",
+				snprintf(path, MAXPGPATH, "pg_tblspc/%u/%s/%u/%u_%s",
 						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
 						 rnode.dbNode, rnode.relNode,
 						 forkNames[forknum]);
 			else
-				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u",
+				snprintf(path, MAXPGPATH, "pg_tblspc/%u/%s/%u/%u",
 						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
 						 rnode.dbNode, rnode.relNode);
 		}
 		else
 		{
 			/* OIDCHARS will suffice for an integer, too */
-			pathlen = 9 + 1 + OIDCHARS + 1
-				+ strlen(TABLESPACE_VERSION_DIRECTORY) + 1 + OIDCHARS + 2
-				+ OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
-			path = (char *) palloc(pathlen);
 			if (forknum != MAIN_FORKNUM)
-				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/t%d_%u_%s",
+				snprintf(path, MAXPGPATH, "pg_tblspc/%u/%s/%u/t%d_%u_%s",
 						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
 						 rnode.dbNode, backend, rnode.relNode,
 						 forkNames[forknum]);
 			else
-				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/t%d_%u",
+				snprintf(path, MAXPGPATH, "pg_tblspc/%u/%s/%u/t%d_%u",
 						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
 						 rnode.dbNode, backend, rnode.relNode);
 		}
 	}
+
 	return path;
 }
 
@@ -534,7 +518,7 @@ Oid
 GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence)
 {
 	RelFileNodeBackend rnode;
-	char	   *rpath;
+	const char *rpath;
 	int			fd;
 	bool		collides;
 	BackendId	backend;
@@ -599,8 +583,6 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence)
 			 */
 			collides = false;
 		}
-
-		pfree(rpath);
 	} while (collides);
 
 	return rnode.node.relNode;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..7be767b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1757,7 +1757,7 @@ PrintBufferLeakWarning(Buffer buffer)
 {
 	volatile BufferDesc *buf;
 	int32		loccount;
-	char	   *path;
+	const char *path;
 	BackendId	backend;
 
 	Assert(BufferIsValid(buffer));
@@ -1782,7 +1782,6 @@ PrintBufferLeakWarning(Buffer buffer)
 		 buffer, path,
 		 buf->tag.blockNum, buf->flags,
 		 buf->refcount, loccount);
-	pfree(path);
 }
 
 /*
@@ -2901,7 +2900,7 @@ AbortBufferIO(void)
 			if (sv_flags & BM_IO_ERROR)
 			{
 				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
+				const char *path;
 
 				path = relpathperm(buf->tag.rnode, buf->tag.forkNum);
 				ereport(WARNING,
@@ -2909,7 +2908,6 @@ AbortBufferIO(void)
 						 errmsg("could not write block %u of %s",
 								buf->tag.blockNum, path),
 						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
 			}
 		}
 		TerminateBufferIO(buf, false, BM_IO_ERROR);
@@ -2927,11 +2925,10 @@ shared_buffer_write_error_callback(void *arg)
 	/* Buffer is pinned, so we can read the tag without locking the spinlock */
 	if (bufHdr != NULL)
 	{
-		char	   *path = relpathperm(bufHdr->tag.rnode, bufHdr->tag.forkNum);
+		const char *path = relpathperm(bufHdr->tag.rnode, bufHdr->tag.forkNum);
 
 		errcontext("writing block %u of relation %s",
 				   bufHdr->tag.blockNum, path);
-		pfree(path);
 	}
 }
 
@@ -2945,11 +2942,10 @@ local_buffer_write_error_callback(void *arg)
 
 	if (bufHdr != NULL)
 	{
-		char	   *path = relpathbackend(bufHdr->tag.rnode, MyBackendId,
+		const char *path = relpathbackend(bufHdr->tag.rnode, MyBackendId,
 										  bufHdr->tag.forkNum);
 
 		errcontext("writing block %u of relation %s",
 				   bufHdr->tag.blockNum, path);
-		pfree(path);
 	}
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 384acae..90267fd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -264,7 +264,7 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
 void
 mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 {
-	char	   *path;
+	const char *path;
 	File		fd;
 
 	if (isRedo && reln->md_fd[forkNum] != NULL)
@@ -298,8 +298,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 		}
 	}
 
-	pfree(path);
-
 	reln->md_fd[forkNum] = _fdvec_alloc();
 
 	reln->md_fd[forkNum]->mdfd_vfd = fd;
@@ -380,7 +378,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 static void
 mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	char	   *path;
+	const char *path;
 	int			ret;
 
 	path = relpath(rnode, forkNum);
@@ -449,8 +447,6 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		}
 		pfree(segpath);
 	}
-
-	pfree(path);
 }
 
 /*
@@ -545,7 +541,7 @@ static MdfdVec *
 mdopen(SMgrRelation reln, ForkNumber forknum, ExtensionBehavior behavior)
 {
 	MdfdVec    *mdfd;
-	char	   *path;
+	const char *path;
 	File		fd;
 
 	/* No work if already open */
@@ -571,7 +567,6 @@ mdopen(SMgrRelation reln, ForkNumber forknum, ExtensionBehavior behavior)
 			if (behavior == EXTENSION_RETURN_NULL &&
 				FILE_POSSIBLY_DELETED(errno))
 			{
-				pfree(path);
 				return NULL;
 			}
 			ereport(ERROR,
@@ -580,8 +575,6 @@ mdopen(SMgrRelation reln, ForkNumber forknum, ExtensionBehavior behavior)
 		}
 	}
 
-	pfree(path);
-
 	reln->md_fd[forknum] = mdfd = _fdvec_alloc();
 
 	mdfd->mdfd_vfd = fd;
@@ -1279,7 +1272,7 @@ mdpostckpt(void)
 	while (pendingUnlinks != NIL)
 	{
 		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+		const char *path;
 
 		/*
 		 * New entries are appended to the end, so if the entry is new we've
@@ -1309,7 +1302,6 @@ mdpostckpt(void)
 						(errcode_for_file_access(),
 						 errmsg("could not remove file \"%s\": %m", path)));
 		}
-		pfree(path);
 
 		/* And remove the list entry */
 		pendingUnlinks = list_delete_first(pendingUnlinks);
@@ -1634,8 +1626,8 @@ _fdvec_alloc(void)
 static char *
 _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 {
-	char	   *path,
-			   *fullpath;
+	const char *path;
+	char	   *fullpath;
 
 	path = relpath(reln->smgr_rnode, forknum);
 
@@ -1644,10 +1636,9 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
 		/* be sure we have enough space for the '.segno' */
 		fullpath = (char *) palloc(strlen(path) + 12);
 		sprintf(fullpath, "%s.%u", path, segno);
-		pfree(path);
 	}
 	else
-		fullpath = path;
+		fullpath = pstrdup(path);
 
 	return fullpath;
 }
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cd23334..c3ee640 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -265,7 +265,7 @@ static int64
 calculate_relation_size(RelFileNode *rfn, BackendId backend, ForkNumber forknum)
 {
 	int64		totalsize = 0;
-	char	   *relationpath;
+	const char *relationpath;
 	char		pathname[MAXPGPATH];
 	unsigned int segcount = 0;
 
@@ -753,7 +753,7 @@ pg_relation_filepath(PG_FUNCTION_ARGS)
 	Form_pg_class relform;
 	RelFileNode rnode;
 	BackendId	backend;
-	char	   *path;
+	const char *path;
 
 	tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
 	if (!HeapTupleIsValid(tuple))
diff --git a/src/include/catalog/catalog.h b/src/include/catalog/catalog.h
index 678a945..d9036fe 100644
--- a/src/include/catalog/catalog.h
+++ b/src/include/catalog/catalog.h
@@ -31,7 +31,7 @@ extern const char *forkNames[];
 extern ForkNumber forkname_to_number(char *forkName);
 extern int	forkname_chars(const char *str, ForkNumber *);
 
-extern char *relpathbackend(RelFileNode rnode, BackendId backend,
+extern const char *relpathbackend(RelFileNode rnode, BackendId backend,
 			   ForkNumber forknum);
 extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 940d9d4..8886db5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -46,7 +46,7 @@
  * FileSeek uses the standard UNIX lseek(2) flags.
  */
 
-typedef char *FileName;
+typedef const char *FileName;
 
 typedef int File;
 
-- 
1.7.10.4

0002-Remove-empty-pfree-definition-from-pg_xlogdump-compa.patchtext/x-patch; charset=us-asciiDownload

>From 9be062aa078ba02106fe4bd3f5429ca650332a36 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Dec 2012 16:38:54 +0100
Subject: [PATCH 2/2] Remove empty pfree() definition from
 pg_xlogdump/compat.c now its not needed anymore

---
 src/bin/pg_xlogdump/compat.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/src/bin/pg_xlogdump/compat.c b/src/bin/pg_xlogdump/compat.c
index dd36e55..b9c242d 100644
--- a/src/bin/pg_xlogdump/compat.c
+++ b/src/bin/pg_xlogdump/compat.c
@@ -43,22 +43,16 @@ tliInHistory(TimeLineID tli, List *expectedTLEs)
 	return false;
 }
 
-void
-pfree(void *a)
-{
-}
-
-
 const char *
 timestamptz_to_str(TimestampTz t)
 {
 	return "";
 }
 
-char *
+const char *
 relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
 {
-	return NULL;
+	return "";
 }
 
 /*
-- 
1.7.10.4

#95

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#90)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-12-11 15:55:35 +0200, Heikki Linnakangas wrote:

I've been molding this patch for a while now, here's what I have this far
(also available in my git repository).

The biggest change is in the error reporting. A stand-alone program that
wants to use xlogreader.c no longer has to provide a full-blown replacement
for ereport(). The only thing that xlogreader.c used ereport() was when it
encounters an invalid record. And even there we had the
emode_for_corrupt_record hack. I think it's a much better API that
XLogReadRecord just returns NULL on an invalid record, and an error string,
and the caller can do what it wants with that. In xlog.c, we'll pass the
error string to ereport(), with the right emode as determined by
emode_for_corrupt_record. xlog.c is no longer concerned with
emode_for_corrupt_record, or error levels in general.

We talked about this earlier, and Tom Lane argued that "it's basically
insane to imagine that you can carve out a non-trivial piece of the backend
that doesn't contain any elog calls."
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00651.php), but
having done just that, it doesn't seem insane to me. xlogreader.c really is
a pretty well contained piece of code. All the complicated stuff that
contains elog calls and pallocs and more is in the callback, which can
freely use all the normal backend infrastructure.

Now that I have read some of that code, I am currently unsure how the
current implementation of this can cooperate with translation, even when
used from the backend?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#92)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On Tue, Dec 11, 2012 at 9:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:

* How about we move pg_xlogdump to contrib? It doesn't feel like the kind of
essential tool that deserves to be in src/bin.

contrib would be fine, but I think src/bin is better. There have been
quite some bugs by now where it would have been useful to have a
reliable xlogdump in core so its really installed.

I think I'm with Heikki on this one. Dumping xlog data is useful, but
it's really for developers and troubleshooters, not something we
expect people to do on a regular basis, so contrib seems appropriate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 13 years ago

In reply to: Robert Haas (#96)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Robert Haas <robertmhaas@gmail.com> writes:

I think I'm with Heikki on this one. Dumping xlog data is useful, but
it's really for developers and troubleshooters, not something we
expect people to do on a regular basis, so contrib seems appropriate.

There are two downsides for contrib rather than src/bin. First,
maintainance and user trust are easier done and achieved in src/bin.
Second, a lot of users won't install contribs in their production server
and will then miss the tool when they need it most. In some places
getting new software installed on a certified production setup is not
easy.

I would agree to get that piece in contrib if we were to work again on
separating contribs into "production monitoring and diagnosis",
"production ready extra" (hstore) and the rest, basically.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Dimitri Fontaine (#97)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

Robert Haas <robertmhaas@gmail.com> writes:

I think I'm with Heikki on this one. Dumping xlog data is useful, but
it's really for developers and troubleshooters, not something we
expect people to do on a regular basis, so contrib seems appropriate.

There are two downsides for contrib rather than src/bin. First,
maintainance and user trust are easier done and achieved in src/bin.

User trust, maybe, but the "maintenance" argument seems bogus.
We ship contrib on the same release schedule as core.

Second, a lot of users won't install contribs in their production server
and will then miss the tool when they need it most.

TBH, I don't believe that ordinary users will need this tool at all,
ever, and thus I don't want it in src/bin/. From a packaging standpoint
it will be a lot easier if it's in contrib ... otherwise I'll probably
have to invent some new sub-RPM along the lines of postgresql-extras
so as to avoid bloating the core server package.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 13 years ago

In reply to: Tom Lane (#98)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Tom Lane <tgl@sss.pgh.pa.us> writes:

User trust, maybe, but the "maintenance" argument seems bogus.
We ship contrib on the same release schedule as core.

I meant maintenance as in updating the code when it needs to be, I'm not
sure contrib systematically receives the same careness as core. I have
no data to back my feeling, though.

TBH, I don't believe that ordinary users will need this tool at all,
ever, and thus I don't want it in src/bin/. From a packaging standpoint
it will be a lot easier if it's in contrib ... otherwise I'll probably
have to invent some new sub-RPM along the lines of postgresql-extras
so as to avoid bloating the core server package.

Oh. I didn't know that the server package would be considered bloated by
anyone and that would impact the way to ship our binaries.

What about splitting contrib *officially* then, not just in your RH
packages, and have postgresql-server-extra-diagnosis, -extra-data-types
and -contrib with the things you tipically don't want in production?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Peter Geoghegan

peter@2ndquadrant.com

about 13 years ago

In reply to: Tom Lane (#98)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I don't believe that ordinary users will need this tool at all,
ever, and thus I don't want it in src/bin/. From a packaging standpoint
it will be a lot easier if it's in contrib ... otherwise I'll probably
have to invent some new sub-RPM along the lines of postgresql-extras
so as to avoid bloating the core server package.

I happen to agree that pg_xlogdump belongs in contrib, but I think
that the importance of avoiding "bloat" has been overstated. Maybe
it's slightly useful to make sure that Postgres can get on the Fedora
CD, but that aside, is including pg_xlogdump here, for example, really
likely to make any appreciable difference package-wise?

pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact
opposite tendency, towards including everything but the kitchen sink,
but at the same time that seems like a very insignificant size.
Perhaps people who live in countries with less bandwidth care about
these things more.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Dimitri Fontaine

dimitri@2ndQuadrant.fr

about 13 years ago

In reply to: Peter Geoghegan (#100)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Peter Geoghegan <peter@2ndquadrant.com> writes:

Perhaps people who live in countries with less bandwidth care about
these things more.

The day they will need it is not the day the bandwidth will magically
increase, is all I'm saying. Better have that around just in case you
get WAL corruption because of a crappy RAID controler or whatnot.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Peter Geoghegan (#100)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 2012-12-11 22:52:09 +0000, Peter Geoghegan wrote:

On 11 December 2012 22:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

TBH, I don't believe that ordinary users will need this tool at all,
ever, and thus I don't want it in src/bin/. From a packaging standpoint
it will be a lot easier if it's in contrib ... otherwise I'll probably
have to invent some new sub-RPM along the lines of postgresql-extras
so as to avoid bloating the core server package.

I happen to agree that pg_xlogdump belongs in contrib

Ok, I think there has been clear support for putting it into contrib, I
can comfortably live with that even though I would prefer otherwise. So
lets concentrate on other things ;)

pg_xlogdump is 141K on my system. I'd hate to see us embrace the exact
opposite tendency, towards including everything but the kitchen sink,
but at the same time that seems like a very insignificant size.
Perhaps people who live in countries with less bandwidth care about
these things more.

Optimized and stripped - which is what most distros do - it's 40k
here. Gzipped - as in packages - its only 20k on its own. So its even
smaller ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Peter Geoghegan

peter@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#85)

Re: logical changeset generation v3 - git repository

On 9 December 2012 19:14, Andres Freund <andres@2ndquadrant.com> wrote:

I pushed a new version which

- is rebased ontop of master
- is based ontop of the new xlogreader (biggest part)
- is base ontop of the new binaryheap.h
- some fixes
- some more comments

I decided to take another look at this, following my earlier reviews
of a substantial subset of earlier versions of this patch, including
my earlier reviews of WAL decoding [1]Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com and focused review of snapshot
building [2]Earlier snapshot building doc review: http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com (itself a subset of WAL decoding). I think it was the
right time to consolidate your multiple earlier patches, because some
of the earlier BDR patches were committed (including "Rearrange
storage of data in xl_running_xacts" [3]"Rearrange storage of data in xl_running_xacts" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c, "Basic binary heap
implementation" [4]"Basic binary heap implementation" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207, "Embedded list interface" [5]"Embedded list interface" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386, and, though it
isn't touched on here and is technically entirely distinct
functionality, "Background worker processes" [6]"Background worker processes" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e). Furthermore, now
that we've gotten past some early rounds of reviewing, it makes sense
to build a *perfectly* (rather than just approximately) atomic unit,
as we work towards something that is actually committable.

So what's the footprint of this big, newly rebased feature branch?
Well, though some of these changes are uncommitted stuff from Heikki
(i.e. XLogReader, which you've modified), and some of this is README
documentation, the footprint is very large. I merged master with your
dev branch (last commit of yours,
743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35),
and the stats are:

91 files changed, 9736 insertions(+), 956 deletions(-)

Note that there is a relatively large number of files affected in part
because the tqual interface was bashed around a bit for the benefit of
logical decoding - a lot of the changes to each of those 91 files are
completely trivial.

I'm very glad that you followed my earlier recommendation of splitting
your demo logical changeset consumer into a contrib module, in the
spirit of contrib/spi, etc. This module, "test_decoding", represents a
logical entry point, if you will, for the entire patch. As unwieldy as
it may appear to be, the patch is (or at least *should be*) ultimately
reducible to some infrastructural changes to core to facilitate this
example logical change-set consumer.

test_decoding contrib module
============================

contrib/Makefile | 1 +
contrib/test_decoding/Makefile | 16 +
contrib/test_decoding/test_decoding.c | 192 ++++

Once again, because test_decoding is a kind of "entry point", it gives
me a nice point to continually refer back to when talking about this
patch. (Incidentally, maybe test_decoding should be called
pg_decoding?).

The regression tests pass, though this isn't all that surprising,
since frankly the test coverage of this patch appears to be quite low.
I know that you're working with Abhijit on improvements to the
isolation tester to verify the correctness of the patch as it relates
to supporting actual, practical logical replication systems. I would
very much welcome any such test coverage (even if it wasn't in a
committable state), since in effect you're asking me to take a leap of
faith in respect of how well this infrastructure will support such
systems – previously, I obliged you and didn't focus on concurrency
and serializability concerns (it was sufficient to print out values/do
some decoding in a toy function), but it's time to take a closer look
at those now, I feel. test_decoding is a client of the logical
change-set producing infrastructure, and there appears to be broad
agreement that that infrastructure needs to treat such consumers in a
way that is maximally abstract. My question is, just how abstract does
this interface have to be, really? How well are you going to support
the use-case of a real logical replication system?

Now, maybe it's just that I haven't being paying attention (in
particular, to the discussion surrounding [3]"Rearrange storage of data in xl_running_xacts" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c – though that commit
doesn't appear to have been justified in terms of commit ordering in
BDR at all), but I would like you to be more demonstrative of certain
things, like:

1. Just what does a logical change-set consumer look like? What things
are always true of one, and never true of one?
2. Please describe in as much detail as possible the concurrency
issues with respect to logical replication systems. Please make
verifiable, testable claims as to how well these issues are considered
here, perhaps with reference to the previous remarks of subject-matter
experts like Chris Browne [7]Chris Browne on Slony and ordering conflicts: http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com, Steve Singer [8]Steve Singer on Slony and transaction isolation level: http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl and Kevin Grittner [9]Kevin Grittner on commit ordering: http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com
following my earlier review.

I'm not all that impressed with where test_decoding is at right now.
There is still essentially no documentation. I think it's notable that
you don't really touch the ReorderBufferTXN passed by the core system
in the test_decoding plugin.

test_decoding and pg_receivellog
========================

I surmised that the way that the test_decoding module is intended to
be used is as a client of receivellog.c (*not* receivelog.c – that
naming is *confusing*, perhaps call it receivelogiclog.c or something.
Better still, make receivexlog handle the logical case rather than
inventing a new tool). The reason for receivellog.c existing, as you
yourself put it, is:

+ /*
+  * We have to use postgres.h not postgres_fe.h here, because there's so much
+  * backend-only stuff in the XLOG include files we need.  But we need a
+  * frontend-ish environment otherwise.	Hence this ugly hack.
+  */

So receivellog.c is part of a new utility called pg_receivellog, in
much the same way as receivexlog.c is part of the existing
pg_receivexlog utility (see commit
b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
talking about these changes:

So far, so good. Incidentally, you forgot to do this:

install: all installdirs
$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
$(INSTALL_PROGRAM) pg_receivexlog$(X)
'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+ $(INSTALL_PROGRAM) pg_receivellog$(X)
'$(DESTDIR)$(bindir)/pg_receivellog$(X)'

So this creates a new binary executable, pg_receivellog, which is
described as “the pg_receivexlog equivalent for logical changes”. Much
like pg_receivexlog, pg_receivellog issues special new replication
protocol commands for logical replication, which account for your
changes to the replication protocol grammar and lexer (i.e.
walsender):

src/backend/replication/repl_gram.y | 32 +-
src/backend/replication/repl_scanner.l | 2 +

You say:

+ /* This is is just for demonstration, don't ever use this code for
anything real! */

uh, why not? What is the purpose of a contrib module, if not to serve
as a minimal example?

So, I went to play with pg_receivellog, and I got lots of output like this:

[peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING: Initiating logical rep
pg_receivellog: could not init logical rep: got 0 rows and 0 fields,
expected 1 rows and 4 fields
pg_receivellog: disconnected. Waiting 5 seconds to try again.

Evidently you expected me to see this message:

+ 	if (!walsnd)
+ 	{
+ 		elog(ERROR, "couldn't find free logical slot. free one or increase
max_logical_slots");
+ 	}

If I did, that might have been okay. I didn't though, presumably
because the “walsnd” variable was wild/uninitialised.

So, I went and set max_logical_slots to something higher than 0, and
restarted. pg_receivellog behaved itself this time.

In one terminal:

[peter@peterlaptop decode]$ tty
/dev/pts/0
[peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING: Initiating logical rep
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication

In another:

[peter@peterlaptop decode]$ tty
/dev/pts/1
[peter@peterlaptop decode]$ psql
Expanded display is used automatically.
psql (9.3devel)
Type "help" for help.

postgres=# insert into b values(66,64);
INSERT 0 1
postgres=# \q
[peter@peterlaptop decode]$ cat test.log
BEGIN 1910
table "b": INSERT: i[int4]:66 j[int4]:64
COMMIT 1910

We're subscribed to logical changes, and everything looks about right.
We have a toy demo of a logical change-set subscriber.

I wondered how this had actually worked. Since test_decoding had done
nothing more than expose some functions, without registering any
callback in the conventional way (hooks, etc), how could it have
worked? That brings me to the interface used by plugins like this
test_decoding.

Plugin interface
===========
So test_decoding uses various type of caches and catalogs. I'm mostly
worried about the core BDR interface that it uses, more-so than this
other stuff. I'm talking about:

src/include/replication/output_plugin.h | 76 ++

One minor gripe is that output_plugin.h isn't going to pass muster
with cpluspluscheck (private is a C++ keyword). There are more serious
problems, though. In particular, I'm quite perplexed at some of the
code that “installs” the test_decoding plugin.

The test_decoding module is hard-coded within pg_receivellog thusly
(the SCONST token here could name an arbitrary module):

+ res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

Furthermore, the names of particular test_decoding routines are hard
coded into core, using libdl/PG_MODULE_MAGIC introspection:

+ XLogReaderState *
+ normal_snapshot_reader(XLogRecPtr startpoint, TransactionId xmin,
+ 					   char *plugin, XLogRecPtr valid_after)
+ {
+ 	/* to simplify things we reuse initial_snapshot_reader */
+ 	XLogReaderState *xlogreader = initial_snapshot_reader(startpoint, xmin);

*** SNIP ***

+
+ 	/* lookup symbols in the shared libarary */
+
+ 	/* optional */
+ 	apply_state->init_cb = (LogicalDecodeInitCB)
+ 		load_external_function(plugin, "pg_decode_init", false, NULL);
+
+ 	/* required */
+ 	apply_state->begin_cb = (LogicalDecodeBeginCB)
+ 		load_external_function(plugin, "pg_decode_begin_txn", true, NULL);

*** SNIP ***

This seems fairly wrong-headed. Comments above this function say:

+ /*
+  * Build a snapshot reader with callbacks found in the shared library "plugin"
+  * under the symbol names found in output_plugin.h.
+  * It wraps those callbacks so they send out their changes via an logical
+  * walsender.
+  */

So the idea is that the names of all functions with public linkage
within test_decoding (their symbols) have magical significance, and
that the core system resolve those magic symbols dynamically. I'm not
aware of this pattern appearing anywhere else within Postgres.
Furthermore, it seems kind of short sighted. Have we not painted
ourselves into a corner with regard to using multiple plugins at once?
This doesn't seem terribly unreasonable, if for example we wanted to
use test_decoding in production to debug a problem, while running a
proper logical replication system and some other logical change-set
consumer in tandem. Idiomatic use of “hooks” allows multiple plugins
to be called for the same call of the authoritative hook by the core
system, as for example when using auto_explain and pg_stat_statements
at the same time. Why not just use hooks? It isn't obvious that you
shouldn't be able to do this. The signature of the function
pg_decode_change (imposed by the function pointer typedef
LogicalDecodeChangeCB) assumes that everything should go through a
passed StringInfo, but I have a hard time believing that that's a good
idea.

It's like your plugin functions as a way of filtering reorder buffers.
It's not as if the core system just passes logical change-sets off, as
one might expect. It is actually the case that clients have to connect
to the server in replication mode, and get their change-sets (as
filtered by their plugin) streamed by a walsender over the wire
protocol directly. What of making changeset subscribers generic
abstractions? Again, maybe you don't have to do anything with the
StringInfo, but that is far from clear from the extant code and
documentation.

Snapshot builder
================

We've seen [1]Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com that the snapshot builder is concerned with building
snapshots for the purposes of timetravel. This is needed to see the
contents of the catalog at a point in time when decoding (see design
documents for more).

I've lumped the tqual/snapshot visibility changes under “Snapshot
builder” too, and anything mostly to do with ComboCids. The
README.SNAPBUILD document (and the code described by it) was
previously the focus of an entire review of its own [2]Earlier snapshot building doc review: http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com.

I still don't see why you're allocating snapshots within
DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
would be better allocated alongside the ReadApplyState that is
directly concerned with snapshots, to better encapsulate the snapshot
stuff. Now, you do at least acknowledge this problem this time around:

+ 	/*
+ 	 * FIXME: The existance of the snapshot builder is pretty obvious to the
+ 	 * outside right now, that doesn't seem to be very good...
+ 	 */

However, the fact is that this function:

+ Snapstate *
+ AllocateSnapshotBuilder(ReorderBuffer *reorder)
+ {

doesn't actually do anything with the ReorderBuffer pointer that it is
passed. So I don't see why you've put off doing this, as if it's
something that would require a non-trivial effort.

One of my major concerns during that review was the need for this “peg
xmin horizon” hack - you presented an example that required the use of
a prepared transaction to artificially peg the global xmin horizon,
and I wasn't happy about that. We were worried about catalog tables
getting vacuumed in a way that prevented us from correctly
interpreting data about types in the face of transactions that mix DML
and DDL.

If the catalog tables were vacuumed, we'd be out of luck - we needed
to do something somewhat analogous to hot_standby_feedback. At the
same time, we need to manage the risk of bloat on the primary due to
non-availability of a standby in some speculative replication system
using this infrastructure. One proposal floated around was to have a
special notion of xmin horizon - a more granular xmin horizon
applicable to only the necessary catalog tables. You didn't pursue
that idea yet, preferring to solve the simpler case. You say of xmin
horizon handling:

+ == xmin Horizon Handling ==
+
+ Reusing MVCC for timetravel access has one obvious major problem:
+ VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
+ obviously, we want autovacuum/manual vacuum to work as before.
+
+ The idea here is to reuse the infrastrcuture built for hot_standby_feedback
+ which allows us to keep the xmin horizon of a walsender backend artificially
+ low. We keep it low enough so we can restart decoding from the last location
+ the client has confirmed to be safely received. The means that we keep it low
+ enough to contain the last checkpoints oldestXid value.
+
+ That also means we need to make that value persist across
restarts/crashes in a
+ very similar manner to twophase.c's. That infrastructure actually also useful
+ to make hot_standby_feedback work properly across primary restarts.

So we jury rig the actual xmin horizon by doing this:

+ 							 /*
+ 							  * inrease shared memory state, so vacuum can work
+ 							  * on tuples we prevent from being purged.
+ 							  */
+ 							 IncreaseLogicalXminForSlot(buf->origptr,
+ 														running->oldestRunningXid);

We switch the WAL Sender proc's xmin while the walsender replies to a
message, while preserving the “real” xmin horizon. Presumably this is
crash safe, since we do this as part of XLOG_RUNNING_XACTS replay (iff
we're doing “logical recovery”; that is, decoding is being performed
as we reach SNAPBUILD_CONSISTENT):

recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata);

I continue to be quite concerned about the failure modes here. I do
not accept that this is no worse than using hot_standby_feedback.
hot_standby_feedback can see a standby bloat up the master because it
has a long-running transaction - it's a process that the standby must
actively engage in. However, what you have here will bloat up the
master passively; standbys have to actively work to *prevent* that
from happening. That's a *fundamental* distinction. Maybe it's
actually reasonable to do that, at least for now, but I think that you
should at least acknowledge the distinction as an important one.

We also use this new tqual.c infrastructure to time-travel during
decoding, with the snapshot built for us by snapshot builder:

+ /*
+  * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+  * obeys.
+  *
+  * Only usable on tuples from catalog tables!
+  *
+  * We don't need to support HEAP_MOVED_(IN|OFF) for now because we
only support
+  * reading catalog pages which couldn't have been created in an older version.
+  *
+  * We don't set any hint bits in here as it seems unlikely to be beneficial as
+  * those should already be set by normal access and it seems to be too
+  * dangerous to do so as the semantics of doing so during timetravel are more
+  * complicated than when dealing "only" with the present.
+  */
+ bool
+ HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot,
+                                      Buffer buffer)

Are you sure that ReorderBuffer.private_data should be a void*? Maybe
we'd be better off if it was a minimal “abstract base class” pointer,
that contained a MemoryContext?

This whole area could use a lot more scrutiny. That's all I have for
now, though.

I'm happy to note that the overhead of computing the pegged
Recent(Global)Xmin is one TransactionIdIsValid, one
TransactionIdPrecedes and, potentially, one assignment.

I am also pleased to see that you're invalidating system caches in a
more granular fashion (for transactions that contain both DDL and DML,
where we cannot rely on the usual Hot Standby where sinval messages
are applied for commit records). That is a subject worthy of another
e-mail, though.

Decoding (“glue code”)
======================

We've seen [1]Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com that decoding is concerned with decoding WAL records
from an xlogreader.h callback into an reorderbuffer.

Decoding means breaking up individual XLogRecord structs, reading them
through an XlogReaderState, and storing them in an Re-Order buffer
(reorderbuffer.c does this, and stores them as ReorderBufferChange
records), while building a snapshot (which is needed in advance of
adding tuples from records). It can be thought of as the small piece
of glue between reorderbuffer and snapbuild that is called by
XLogReader (DecodeRecordIntoReorderBuffer() is the only public
function, which will be called by the WAL sender – previously, this
was called by plugins directly).

An example of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

The pg_proc accessible utility function pg_relation_by_filenode() -
which you've documented - doesn't appear to be used at present (it's
just a way of exposing the core infrastructure, as described under
“Miscellaneous thoughts”) . A new index is created on pg_class
(reltablespace oid_ops, relfilenode oid_ops).

We've seen that we need a whole new infrastructure for resolving
relfilenodes to relation OIDs, because only relfilenodes are available
from the WAL stream, and in general the mapping isn't stable, as for
example when we need to do a table rewrite. We have a new syscache for
this.

We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
relfilenode and combocids. I'm still not clear on how you're managing
corner case with relfilenode/table oid mapping that Robert spoke of
previously [17]Robert of relfilenodes: http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com. Could you talk about that?

Reorder buffer
==============

Last time around [1]Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com, this was known as ApplyCache. It's still
concerned with the management of logical replay cache - it reassembles
transactions from a stream of interspersed changes. This is what a
design doc previously talked about under “4.5 - TX reassembly” [14]“WAL decoding, attempt #2” design documents: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com.

src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++
src/include/replication/reorderbuffer.h | 284 +++++

Last time around, I described spooling to disk, like a tuplestore, as
a probable prerequisite to commit - I raise that now because I thought
that this was the place where you'd most likely want to do that.
Concerns about the crash-safety of buffered change-sets were raised
too.

You say this in a README:

+ * crash safety, restartability & spilling to disk
+ * consistency with the commit status of transactions
+ * only a minimal amount of synchronous work should be done inside individual
+ transactions
+
+ In our opinion those problems are restricting progress/wider distribution of
+ these class of solutions. It is our aim though that existing solutions in this
+ space - most prominently slony and londiste - can benefit from the work we are
+ doing & planning to do by incorporating at least parts of the changeset
+ generation infrastructure.

So, have I understood correctly - are you proposing that we simply
outsource this to something else? I'm not sure how I feel about that,
but I'd like clarity on this matter.

reorderbuffer.h should have way, way more comments for each of the
structs. I want to see detailed comments, like those you see for the
structs in parsenodes.h - you shouldn't have to jump to some design
document to see how each struct fits within the overall design of
reorder buffering.

XLog stuff (in particular, the new XLogReader)
=================================

Andres rebased on top of Heikki's XLogReader patch for the purposes of
BDR, and privately identified this whole area to me as a particular
concern for this review. The version that I'm reviewing here is not
the version that Andres described last week, v3.0 [10]v3.0 of the XLogReader (Andres' revision): http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de, but a slight
revision thereof, v3.1 [11]v3.1 of the XLogReader(Andres' slight tweak of [10]): http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de. See the commit message in Andres' feature
branch for full details [12]Andres' XLogReader commit: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc.

There was some controversy over the approach to implementing a
“generic xlog reader”[13]Heikki objects to XLogReader approach, proposes alternative: http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com. This revision of Andres' work presumably
resolves that controversy, since it heavily incorporates Heikki's own
work. Heikki has described the design of his original XLogReader patch
[18]: Heikki on his XLogReader: http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php

pg_xlogdump is a hacker-orientated utility that has been around in
various forms for quite some time (i.e. at least since the 8.3 days),
concerned with reading and writing Postgres transaction logs for
debugging purposes. It has long been obvious that it would be useful
to maintain along with Postgres (there has been a tendency for
xlogdump to fall behind, and only stable releases are supported), but
the XLogReader-related refactoring makes adding an official xlogdump
tool quite compelling (we're talking about 462 lines of wrapper code
for pg_xlogdump.c, against several thousands of lines of code for the
version in common use [15]xlogdump satellite project: https://github.com/snaga/xlogdump that has hard-coded per-version knowledge
of catalog oids and things like that). I think that some of the
refactoring that Simon did to xlog.c last year [16]Numerous refactoring commits. Main split was commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf makes things
easier here, and kind of anticipates this.

Again, with pg_xlogdump you forgot to do this:

pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport
$(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS)
$(libpq_pgport) -o $@$(X)

+ install: all installdirs
+ 	$(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)'/pg_xlogdump$(X)
+
+ installdirs:
+ 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+ uninstall:
+ 	rm -f $(addprefix '$(DESTDIR)$(bindir)'/, pg_xlogdump$(X))

pg_xlogdump could be considered a useful way of testing the XLogReader
and decoding functionality, independent of the test_decoding plugin.
It is something that I'll probably use to debug this patch over the
next few weeks. Example usage:

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
combo: 4294967295
xlog record: rmgr: Heap , record_len: 175, tot_len: 207,
tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
insert: rel 1663/12933/12671; tid 7/44
xlog record: rmgr: Btree , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
insert: rel 1663/12933/12673; tid 1/355

In another thread, Robert and Heikki remarked that pg_xlogdump ought
to be in contrib, and not in src/bin. As you know, I am inclined to
agree.

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
fatal_error: requested WAL segment 012345670000000000000009 has
already been removed

This error message seems a bit presumptuous to me; as it happens there
never was such a WAL segment. Saying that there was introduces the
possibility of operator error.

This appears to be superfluous:

*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 18,23 ****
--- 18,24 ----

#include "access/htup_details.h"
#include "access/xlog.h"
+ #include "access/xlog_fn.h"
#include "access/xlog_internal.h"

The real heavyweight here is xlogreader.c, at 962 lines. The module
refactors xlog.c, moving ReadRecord and some supporting functions to
xlogreader.c. Those supporting functions now operate on *generic*
XLogReaderState rather than various global variables. The idea here is
that the client of the API calls ReadRecord repeatedly to get each
record.

There is a callback of type XLogPageReadCB, which is used by the
client to obtain a given page in the WAL stream. The XLogReader
facility is responsible for decoding the WAL into records, but the
client is responsible for supplying the physical bytes via the
callback within XLogReader state. There is an error-handling callback
too, added by Andres. Andres added a new function,
XLogFindNextRecord(), which is used for checking wether RecPtr is a
valid XLog address for reading and to find the first valid address
after some address when dumping records, for debugging purposes.

Why did you move the page validation handling into XLogReader?

Support was added for reading pages which are only partially valid,
which seems reasonable. The callback that acts as a replacement for
emode_for_corrupt_record might be a bit questionable.

I'd like to have more to say on this. I'll leave that for another day.

I note that there are many mallocs in this module (see note below
under “Miscellaneous thoughts”).

heapam and other executor stuff
===============================

One aspect of this patch that I feel certainly warrants another of
these subsections is the changes to heapam.c and related executor
changes. These are essentially changes to functions called by
nodeModifyTable.c frequently, including functions like
heap_hot_search_buffer, heap_insert, heap_multi_insert and
heap_delete. We now have to do extra logical logging, and we need
primary key values to be looked up.

Files changed include:

What of this? (I'm using the dellstore sample database, as always):

postgres=# \d+ orders
Table "public.orders"
*** SNIP ***
Indexes:
"orders_pkey" PRIMARY KEY, btree (orderid)
"ix_order_custid" btree (customerid)
***SNIP ***

postgres=# delete from orders where orderid = 77;
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
DELETE 1

I don't have time to figure out what this issue is right now.

Hot Standby, Replication and libpq stuff
========================================

Not forgetting existing replication infrastructure and libpq stuff
affected by this patch. Files under this category that have been
modified are:

I take particular interest in bgwriter.c here. You're doing this:

+ 		 * Log a new xl_running_xacts every now and then so replication can get
+ 		 * into a consistent state faster and clean up resources more
+ 		 * frequently. The costs of this are relatively low, so doing it 4
+ 		 * times a minute seems fine.

What about the power consumption of the bgwriter? I think that the way
try to interact with the existing loop logic is ill-considered. Just
why is the bgwriter the compelling auxiliary process in which to do
this extra work?

Quite a lot of code has been added to walsender. This is mostly down
to some new functions, responsible for initialising logical
replication:

! typedef void (*WalSndSendData)(bool *);
! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
  static void InitWalSenderSlot(void);
  static void WalSndKill(int code, Datum arg);
! static void XLogSendPhysical(bool *caughtup);
! static void XLogSendLogical(bool *caughtup);
  static void IdentifySystem(void);
  static void StartReplication(StartReplicationCmd *cmd);
+ static void CheckLogicalReplicationRequirements(void);
+ static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
+ static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+ static void ComputeLogicalXmin(void);

This is mostly infrastructure for initialising and starting logical replication.

Initialisation means finding a free “logical slot” from shared memory,
then looping until the new magic xmin horizon for logical walsenders
(stored in their “slot”) is that of the weakest link (think local
global xmin).

+ 	 * FIXME: think about solving the race conditions in a nicer way.
+ 	 */
+ recompute_xmin:
+ 	walsnd->xmin = GetOldestXmin(true, true);
+ 	ComputeLogicalXmin();
+ 	if (walsnd->xmin != GetOldestXmin(true, true))
+ 		goto recompute_xmin;

Apart from the race conditions that I'm not confident are addressed
here, I think that the above could easily get stuck indefinitely in
the event of contention.

Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
command. Initialisation also means that decoding state is allocated (a
snapshot reader is initialised), and we report back success or failure
to the client that's using the streaming replication protocol (i.e. in
our toy example, pg_receivellog).

Starting logical replication means we load the previously initialised
slot, and find a snapshot reader plugin (using the “magic symbols”
pattern described above, under “Plugin interface”).

Why do we have to “find” a logical slot twice (both during
initialisation and starting)?

Since I've already described the “peg xmin horizon” stuff under
“Snapshot builder”, I won't belabour the point. I think that I have
more to say about this, but not today.

Minor point: This is a terrible name for the variable in question:

+ LogicalWalSnd *walsnd;

Miscellaneous thoughts
======================

You're still using C stdlib functions like malloc, free, calloc quite
a bit. My concern is that this points to a lack of thought about the
memory management strategy; why are you still not using memory
contexts in some places? If it's so difficult to anticipate what
clients of, say, XLogReaderAllocate() want for the lifetime of their
memory, then likely as not those clients should be doing their own
memory allocation, and passing the allocated buffer directly. If it is
obvious that the memory ought to persist indefinitely (and I think
it's your contention that it is in the case of XLogReaderAllocate()),
I'd just allocate it in the top memory context. Now, I am aware that
there are a trivial number of backend mallocs that you can point to as
precedent here, but I'm still not satisfied with your explanation for
using malloc(). At the very least, you ought to be handling the case
where malloc returns NULL, and you're not doing so consistently.

Memory contexts are very handy for debugging. As you know, I wrote a
little Python script with GDB bindings, that walks the tree of memory
contexts and prints out statistics about them using the
aset.c/AllocSetStats() infrastructure. It isn't difficult to imagine
that something like that could be quite useful with this patch - I'd
like to be able to easily determine how many snapshot builders have
been allocated from within a given backend, for example (though I see
you refcount that anyway for reasons that are not immediately apparent
- just debugging?).

Minor gripes:

* There is no need to use a *.txt extension for README files; we don't
currently use those anywhere else.

* If you only credit the PGDG and not the Berkeley guys (as you
should, for the most part), there is no need to phrase the notice
“Portions Copyright...”. You should just say “Copyright...”.

* You're still calling function pointer typedefs things like
LogicalDecodeInitCB. As I've already pointed out, you should prefer
the existing conventions (call it something like
LogicalDecodeInit_hook_type).

Under this section are all modifications to files that are not
separately described under some dedicated section header. I'll quickly
pass remark on them.

System caches were knocked around a bit:

// LocalExecuteInvalidationMessage now exposed:
src/backend/utils/cache/inval.c | 2 +-
// relcache.c has stray whitespace:
src/backend/utils/cache/relcache.c | 1 -
// New RelationMapFilenodeToOid() function:
src/backend/utils/cache/relmapper.c | 53 +
// New RELFILENODE syscache added:
src/backend/utils/cache/syscache.c | 11 +
// Headers:
src/include/storage/sinval.h | 2 +
src/include/utils/relmapper.h | 2 +
src/include/utils/syscache.h | 1 +

These are only of tangential interest to snapshot building, and so are
not described separately. Essentially, just “add new syscache”
boilerplate. There's also a little documentation, covering only the
pg_relation_by_filenode() utility function (this exposes
RelationMapFilenodeToOid()/RELFILENODE syscache):

doc/src/sgml/func.sgml | 23 +-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +

The following files were only changed due to the change in the tqual.c
interfaces of HeapTupleSatisfies*().

That's all the feedback that I have for now. I'd have liked to have
gone into more detail in many cases, but I cannot only do so much. I
always like to start off rounds of review with “this is the current
state of play as I see it” type e-mails. There will be more to follow,
now that I have that out of the way.

References
==========

[1]: Earlier WAL decoding review: http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com
http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[2]: Earlier snapshot building doc review: http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com
http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com

[3]: "Rearrange storage of data in xl_running_xacts" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c

[4]: "Basic binary heap implementation" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207

[5]: "Embedded list interface" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386

[6]: "Background worker processes" commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e

[7]: Chris Browne on Slony and ordering conflicts: http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com
http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com

[8]: Steve Singer on Slony and transaction isolation level: http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl
http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl

[9]: Kevin Grittner on commit ordering: http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com
http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com

[10]: v3.0 of the XLogReader (Andres' revision): http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de
http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de

[11]: v3.1 of the XLogReader(Andres' slight tweak of [10]): http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de
http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de

[12]: Andres' XLogReader commit: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc

[13]: Heikki objects to XLogReader approach, proposes alternative: http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com
http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com

[14]: “WAL decoding, attempt #2” design documents: http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com
http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[15]: xlogdump satellite project: https://github.com/snaga/xlogdump

[16]: Numerous refactoring commits. Main split was commit: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf

[17]: Robert of relfilenodes: http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com
http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com

[18]: Heikki on his XLogReader: http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Andres Freund (#95)

Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

On 11.12.2012 21:11, Andres Freund wrote:

Now that I have read some of that code, I am currently unsure how the
current implementation of this can cooperate with translation, even when
used from the backend?

Hmm, there was a gettext() call missing from report_invalid_record.
That's where the translation needs to happen. Fixed now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Heikki Linnakangas (#104)

Re: Re: [PATCH 02/14] Add support for a generic wal reading facility dubbed XLogReader

Heikki Linnakangas wrote:

On 11.12.2012 21:11, Andres Freund wrote:

Now that I have read some of that code, I am currently unsure how the
current implementation of this can cooperate with translation, even when
used from the backend?

Hmm, there was a gettext() call missing from report_invalid_record.
That's where the translation needs to happen. Fixed now.

You need to call gettext_noop() in the string literals as well, unless
you've added the function and argument number to the gettext trigger
list in nls.mk.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Peter Geoghegan (#103)

Re: logical changeset generation v3 - git repository

Hi Peter!

Thanks for the review, you raise many noteworthy points. This is going
to be a long mail...

On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:

I'm very glad that you followed my earlier recommendation of splitting
your demo logical changeset consumer into a contrib module, in the
spirit of contrib/spi, etc. This module, "test_decoding", represents a
logical entry point, if you will, for the entire patch. As unwieldy as
it may appear to be, the patch is (or at least *should be*) ultimately
reducible to some infrastructural changes to core to facilitate this
example logical change-set consumer.

To be fair that point has been brought up first by Robert and Kevin. But
yes, its now included. Which is totally sensible.

Once again, because test_decoding is a kind of "entry point", it gives
me a nice point to continually refer back to when talking about this
patch. (Incidentally, maybe test_decoding should be called
pg_decoding?).

I am not particularly happy with the current name, I just named it akin
to test_parser/. I don't really like pg_decoding tho, ISTM the pg_
prefix doesn't serve a point there, since its not a binary or such which
will lie around in some general namespace.

Other suggestions?

The regression tests pass, though this isn't all that surprising,
since frankly the test coverage of this patch appears to be quite low.

Yes, that certainly needs to be adressed.

I obliged you and didn't focus on concurrency
and serializability concerns (it was sufficient to print out values/do
some decoding in a toy function), but it's time to take a closer look
at those now, I feel.

Agreed.

test_decoding is a client of the logical
change-set producing infrastructure, and there appears to be broad
agreement that that infrastructure needs to treat such consumers in a
way that is maximally abstract. My question is, just how abstract does
this interface have to be, really? How well are you going to support
the use-case of a real logical replication system?

Now, maybe it's just that I haven't being paying attention (in
particular, to the discussion surrounding [3] – though that commit
doesn't appear to have been justified in terms of commit ordering in
BDR at all), but I would like you to be more demonstrative of certain
things, like:

That commit was basically just about being able to discern which xids
are toplevel and which are subtransaction xids. snapbuild.c only needs
to wait for toplevel xids now and doesn't care about subtransaction xids
which made the code significantly simpler.

1. Just what does a logical change-set consumer look like? What things
are always true of one, and never true of one?

2. Please describe in as much detail as possible the concurrency
issues with respect to logical replication systems. Please make
verifiable, testable claims as to how well these issues are considered
here, perhaps with reference to the previous remarks of subject-matter
experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9]
following my earlier review.

Not sure what you want to hear here to be honest.

Let me try anyway:

Transactions (and the contained changes) are guaranteed to be replayed
in commit-order where the order is defined by the LSN/position in the
xlog stream of the commit record[1]Note that there are potential visibility differences between the order in which transactions are marked as visible in WAL, in the clog and in memory (procarray) since thats not done while holding a lock over the whole period. Thats an existing property with HS.. Thats the same ordering that Hot
Standby uses.
The code achieves that order by reading the xlog records sequentially
in-order and replaying the begin/changes/commmit "events" everytime it
reads a commit record and never at a different time [1]Note that there are potential visibility differences between the order in which transactions are marked as visible in WAL, in the clog and in memory (procarray) since thats not done while holding a lock over the whole period. Thats an existing property with HS..

Several people in the thread you referenced seemed to agree that
commit-ordering is a sensible choice.

[1]: Note that there are potential visibility differences between the order in which transactions are marked as visible in WAL, in the clog and in memory (procarray) since thats not done while holding a lock over the whole period. Thats an existing property with HS.
order in which transactions are marked as visible in WAL, in the clog
and in memory (procarray) since thats not done while holding a lock over
the whole period. Thats an existing property with HS.

I'm not all that impressed with where test_decoding is at right now.
There is still essentially no documentation.

I will add comments.

I think it's notable that you don't really touch the ReorderBufferTXN
passed by the core system in the test_decoding plugin.

Don't think thats saying very much except that 1) we don't pass on
enough information about transactions yet (e.g. commit timestamp). 2)
the output plugin is simple.

test_decoding and pg_receivellog
========================

I surmised that the way that the test_decoding module is intended to
be used is as a client of receivellog.c (*not* receivelog.c – that
naming is *confusing*, perhaps call it receivelogiclog.c or something.

I am happy to name it any way people want. Once decided I think it
should move out of bin/pg_basebackup and the code (which is mostly
copied from streamutil.c and pg_receivexlog) should be cleaned up
considerably.

Better still, make receivexlog handle the logical case rather than
inventing a new tool). The reason for receivellog.c existing, as you
yourself put it, is:

I don't think they really can be merged, the differences are notable
already and are going to get bigger.

+ /*
+  * We have to use postgres.h not postgres_fe.h here, because there's so much
+  * backend-only stuff in the XLOG include files we need.  But we need a
+  * frontend-ish environment otherwise.	Hence this ugly hack.
+  */
So receivellog.c is part of a new utility called pg_receivellog, in
much the same way as receivexlog.c is part of the existing
pg_receivexlog utility (see commit
b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
talking about these changes:

receivelog.c is old code, I didn't change anything nontrivial there. The
one change is gone now since Heikki committed the
xlog_internal./xlog_fn.h change.

src/backend/utils/misc/guc.c | 11 +
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_basebackup.c | 4 +-
src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++
src/bin/pg_basebackup/pg_receivexlog.c | 4 +-
src/bin/pg_basebackup/receivelog.c | 4 +-
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +

So far, so good. Incidentally, you forgot to do this:

install: all installdirs
$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
$(INSTALL_PROGRAM) pg_receivexlog$(X)
'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+ $(INSTALL_PROGRAM) pg_receivellog$(X)
'$(DESTDIR)$(bindir)/pg_receivellog$(X)'

I actually didn't forget to do this, but I didn't want to install
binaries that probably won't survive under the current name. That seems
to have been a bad idea since Michael and you noticed it as missing ;)

So this creates a new binary executable, pg_receivellog, which is
described as “the pg_receivexlog equivalent for logical changes”. Much
like pg_receivexlog, pg_receivellog issues special new replication
protocol commands for logical replication, which account for your
changes to the replication protocol grammar and lexer (i.e.
walsender):

src/backend/replication/repl_gram.y | 32 +-
src/backend/replication/repl_scanner.l | 2 +

You say:

+ /* This is is just for demonstration, don't ever use this code for
anything real! */

uh, why not? What is the purpose of a contrib module, if not to serve
as a minimal example?

Stupid copy & paste error from the old example code. The code should
probably grow a call to some escape functionality and more coments to
serve as a good example but otherwise its ok.

Evidently you expected me to see this message:
+ 	if (!walsnd)
+ 	{
+ 		elog(ERROR, "couldn't find free logical slot. free one or increase
max_logical_slots");
+ 	}
If I did, that might have been okay. I didn't though, presumably
because the “walsnd” variable was wild/uninitialised.

The problem was earlier, CheckLogicalReplicationRequirements() should
have checked for a reasonable max_logical_slots value but only checked
for wal_level. Fix pushed.

So, I went and set max_logical_slots to something higher than 0, and
restarted. pg_receivellog behaved itself this time.

In one terminal:

[peter@peterlaptop decode]$ tty
/dev/pts/0
[peter@peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING: Initiating logical rep
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication

Those currently are WARNINGs to make them easier to see, they obviously
need to be demoted at some point.

One minor gripe is that output_plugin.h isn't going to pass muster
with cpluspluscheck (private is a C++ keyword).

Fix pushed.

Plugin interface
===========
So test_decoding uses various type of caches and catalogs. I'm mostly
worried about the core BDR interface that it uses, more-so than this
other stuff. I'm talking about:

I have asked for input on the interface in a short email
http://archives.postgresql.org/message-id/20121115014250.GA5844%40awork2.anarazel.de
but didn't get responses so far.

I am happy to change the interface, its just did the first thing that
made sense to me.

Steve Singer - who I believe played a bit with writing his own output
plugin - seemed to be ok with it.

The test_decoding module is hard-coded within pg_receivellog thusly
(the SCONST token here could name an arbitrary module):

+ res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

pg_receivellog will/should grow a --output-plugin parameter at some
point.

+ 	/* optional */
+ 	apply_state->init_cb = (LogicalDecodeInitCB)
+ 		load_external_function(plugin, "pg_decode_init", false, NULL);

So the idea is that the names of all functions with public linkage
within test_decoding (their symbols) have magical significance, and
that the core system resolve those magic symbols dynamically.

I'm not aware of this pattern appearing anywhere else within Postgres.

There's _PG_init/fini...

Furthermore, it seems kind of short sighted. Have we not painted
ourselves into a corner with regard to using multiple plugins at once?
This doesn't seem terribly unreasonable, if for example we wanted to
use test_decoding in production to debug a problem, while running a
proper logical replication system and some other logical change-set
consumer in tandem.

How does the scheme prevent you from doing that? Simply open up another
replication connection and specify a different output plugin there?
Not sure how two output plugins in one process would make sense?

Idiomatic use of “hooks” allows multiple plugins
to be called for the same call of the authoritative hook by the core
system, as for example when using auto_explain and pg_stat_statements
at the same time. Why not just use hooks? It isn't obvious that you
shouldn't be able to do this.

I considered using hooks but it seemed not to be a good fit. Let me
describe my thought process:

1) we want different output formats to be available in the same server &
database
2) the wished-for plugin should be specified via the replication
connection
3) thus shared_preload_libraries and such aren't really helpful
4) we need to load the plugin ourselves
5) We could simply load it and let the object's _PG_init() call
something like OutputPluginInitialize(begin_callback,
change_callback, commit_callback), but then we would need to handle
the case where that wasn't called and such
6) Going the OutputPluginInitialize route didn't seem to offer any
benefits, thus the hardcoded symbol names

The signature of the function
pg_decode_change (imposed by the function pointer typedef
LogicalDecodeChangeCB) assumes that everything should go through a
passed StringInfo, but I have a hard time believing that that's a good
idea.

I don't particularly like passing a StringInfo either, but what would
you rather pass? Note that StringInfo's are what's currently used in
normal fe/be communication.

Doing the sending out directly in the output plugin seems to be a bad
idea because:
1) we need to handle receiving replies from the receiving side, like
keepalives and such, also we need to terminate the connection if no
reply has come inside wal_sender_timeout.
2) the output plugins imo shouldn't know they are sending out to a
walsender, we might want to allow sending from inside a function, to
disk or anything at some point.

Does the reasoning make sense to you?

It's like your plugin functions as a way of filtering reorder buffers.
It's not as if the core system just passes logical change-sets off, as
one might expect. It is actually the case that clients have to connect
to the server in replication mode, and get their change-sets (as
filtered by their plugin) streamed by a walsender over the wire
protocol directly. What of making changeset subscribers generic
abstractions?

Sorry, I cannot follow you here. What kind of architecture are you
envisioning here?

Snapshot builder
================

I still don't see why you're allocating snapshots within
DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
would be better allocated alongside the ReadApplyState that is
directly concerned with snapshots, to better encapsulate the snapshot
stuff. Now, you do at least acknowledge this problem this time around:
+ 	/*
+ 	 * FIXME: The existance of the snapshot builder is pretty obvious to the
+ 	 * outside right now, that doesn't seem to be very good...
+ 	 */

I think that comment was there in the last round as well ;)

However, the fact is that this function:
+ Snapstate *
+ AllocateSnapshotBuilder(ReorderBuffer *reorder)
+ {
doesn't actually do anything with the ReorderBuffer pointer that it is
passed. So I don't see why you've put off doing this, as if it's
something that would require a non-trivial effort.

Well, there simply are a lot of things that need a littlebit of effort.
In total thats still a nontrivial amount.

And I wasn't sure how much of allt hat needed to change due to changes
in the actual snapshot building and the xlogreader swap. Turned out not
too many...

[ The xmin handling deserves its own mail, I'll respond to that
separately]

I am also pleased to see that you're invalidating system caches in a
more granular fashion (for transactions that contain both DDL and DML,
where we cannot rely on the usual Hot Standby where sinval messages
are applied for commit records). That is a subject worthy of another
e-mail, though.

There still are two issues worth improving with this though:
1) clear the whole cache when entering/leaving timetravel
2) Don't replay normal "present day" inval messages while in timetravel
* That may actually be able to cause errors when trying to reload the
relcache...

1) seems pretty uncontroversion to me since it should happen really
infrequently and it seems to be semantically correct. I have to think
some more about 2), there are some interesting things with relmap
updates due to CLUSTER et al. on nailed tables...

Decoding (“glue code”)
======================

We've seen [1] that decoding is concerned with decoding WAL records
from an xlogreader.h callback into an reorderbuffer.

Decoding means breaking up individual XLogRecord structs, reading them
through an XlogReaderState, and storing them in an Re-Order buffer
(reorderbuffer.c does this, and stores them as ReorderBufferChange
records), while building a snapshot (which is needed in advance of
adding tuples from records). It can be thought of as the small piece
of glue between reorderbuffer and snapbuild that is called by
XLogReader (DecodeRecordIntoReorderBuffer() is the only public
function, which will be called by the WAL sender – previously, this
was called by plugins directly).

An example of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

src/backend/replication/logical/decode.c | 494 ++++++++
src/backend/replication/logical/logicalfuncs.c | 224 ++++
src/backend/utils/adt/dbsize.c | 79 ++
src/include/catalog/indexing.h | 2 +
src/include/catalog/pg_proc.h | 2 +
src/include/replication/decode.h | 21 +
src/include/replication/logicalfuncs.h | 45 +
src/include/storage/itemptr.h | 3 +
src/include/utils/builtins.h | 1 +

The pg_proc accessible utility function pg_relation_by_filenode() -
which you've documented - doesn't appear to be used at present (it's
just a way of exposing the core infrastructure, as described under
“Miscellaneous thoughts”)

It's not required for anything (and I don't think it ever will be). Its
was handy during development of this though, and I could have needed
earlier during DBAish work.
Hm. We could use it to add a regression test for the new syscache
though...

. A new index is created on pg_class
(reltablespace oid_ops, relfilenode oid_ops).

We've seen that we need a whole new infrastructure for resolving
relfilenodes to relation OIDs, because only relfilenodes are available
from the WAL stream, and in general the mapping isn't stable, as for
example when we need to do a table rewrite. We have a new syscache for
this.

We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
relfilenode and combocids. I'm still not clear on how you're managing
corner case with relfilenode/table oid mapping that Robert spoke of
previously [17]. Could you talk about that?

Sure. (Found another potential bug due to this already). Robert talks
about two dangers:

1) relfilenode => oid is not unique
2) the relfilenode => oid mapping changes over time

1) is solved by only looking up relfilenodes by (reltablespace, oid)
(which is why the syscache is over those two thanks to Roberts
observations). We can recognize shared relations via spcNode ==
GLOBALTABLESPACE_OID and we can recognize nailed tables by the fact that
they cannot be looked up in pg_class (there's an InvalidOid stored in
the pg_class for them).
Shared and nailed tables are then looked up via the new
RelationMapFilenodeToOid function.
As the decoding is now per-database (we don't have the other catalogs)
we skip processing tuples when dbNode != MyDatabaseId.

So I think 1) is handled by that?

2) Is solved by the fact that the syscache now works properly
time-relativized as well. That is, if you look up the (reltablespace,
relfilenode) => oid mapping in the syscache you get the correct result
for the current moment in time (whats the correct term for current when
its only current from the POV of timetravelling?). Due to the proper
cache invalidation handling old mappings are purged correctly as well.

Reorder buffer
==============

Last time around [1], this was known as ApplyCache. It's still
concerned with the management of logical replay cache - it reassembles
transactions from a stream of interspersed changes. This is what a
design doc previously talked about under “4.5 - TX reassembly” [14].

Happier with the new name?

src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++
src/include/replication/reorderbuffer.h | 284 +++++

Last time around, I described spooling to disk, like a tuplestore, as
a probable prerequisite to commit - I raise that now because I thought
that this was the place where you'd most likely want to do that.
Concerns about the crash-safety of buffered change-sets were raised
too.

Yes, this certainly is a prerequisite to commit.

You say this in a README:

+ * crash safety, restartability & spilling to disk
+ * consistency with the commit status of transactions
+ * only a minimal amount of synchronous work should be done inside individual
+ transactions
+
+ In our opinion those problems are restricting progress/wider distribution of
+ these class of solutions. It is our aim though that existing solutions in this
+ space - most prominently slony and londiste - can benefit from the work we are
+ doing & planning to do by incorporating at least parts of the changeset
+ generation infrastructure.

So, have I understood correctly - are you proposing that we simply
outsource this to something else? I'm not sure how I feel about that,
but I'd like clarity on this matter.

No, this needs to be implemented in the reorderbuffer. Thats the next
task I will work on after committing the actual snapshot export.

reorderbuffer.h should have way, way more comments for each of the
structs. I want to see detailed comments, like those you see for the
structs in parsenodes.h - you shouldn't have to jump to some design
document to see how each struct fits within the overall design of
reorder buffering.

Will go over it. I am wondering whether it makes sense to split most of
the ones in the header in private/public part...

XLog stuff (in particular, the new XLogReader)
=================================

There was some controversy over the approach to implementing a
“generic xlog reader”[13]. This revision of Andres' work presumably
resolves that controversy, since it heavily incorporates Heikki's own
work. Heikki has described the design of his original XLogReader patch
[18].

I hope its resolved. I won't believe it until any version is committed
;)

pg_xlogdump could be considered a useful way of testing the XLogReader
and decoding functionality, independent of the test_decoding plugin.
It is something that I'll probably use to debug this patch over the
next few weeks. Example usage:

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
combo: 4294967295
xlog record: rmgr: Heap , record_len: 175, tot_len: 207,
tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
insert: rel 1663/12933/12671; tid 7/44
xlog record: rmgr: Btree , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
insert: rel 1663/12933/12673; tid 1/355

In another thread, Robert and Heikki remarked that pg_xlogdump ought
to be in contrib, and not in src/bin. As you know, I am inclined to
agree.

Moved in Heikki's worktree by now.

[peter@peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
fatal_error: requested WAL segment 012345670000000000000009 has
already been removed

This error message seems a bit presumptuous to me; as it happens there
never was such a WAL segment. Saying that there was introduces the
possibility of operator error.

FWIW its the "historical" error message for that ;). I am happy to
change it to something else.

The real heavyweight here is xlogreader.c, at 962 lines. The module
refactors xlog.c, moving ReadRecord and some supporting functions to
xlogreader.c. Those supporting functions now operate on *generic*
XLogReaderState rather than various global variables. The idea here is
that the client of the API calls ReadRecord repeatedly to get each
record.

There is a callback of type XLogPageReadCB, which is used by the
client to obtain a given page in the WAL stream. The XLogReader
facility is responsible for decoding the WAL into records, but the
client is responsible for supplying the physical bytes via the
callback within XLogReader state. There is an error-handling callback
too, added by Andres.

Gone again, solved way much better by Heikki.

Andres added a new function,
XLogFindNextRecord(), which is used for checking wether RecPtr is a
valid XLog address for reading and to find the first valid address
after some address when dumping records, for debugging purposes.

And thats a very much needed safety feature for logical decoding. We
need to make sure LSNs specified by the user don't point into the middle
of a record. That would make some ugly things possible.

Why did you move the page validation handling into XLogReader?

Because its needed from xlogdump and wal decoding as
well. Reimplementing it there doesn't seem to be a good idea. Skimping
on checks seems neither.

Any arguments against?

heapam and other executor stuff
===============================

One aspect of this patch that I feel certainly warrants another of
these subsections is the changes to heapam.c and related executor
changes. These are essentially changes to functions called by
nodeModifyTable.c frequently, including functions like
heap_hot_search_buffer, heap_insert, heap_multi_insert and
heap_delete. We now have to do extra logical logging, and we need
primary key values to be looked up.

The amount of extra logging should be relatively small though - some
preliminary tests seem to confirm that for me. But it certainly needs
some more validation.

I think it would be sensible to add the primary/candidate key as a
relcache/RelationData attribute. Do others agree?

heap_hot_search_buffer was changed in the course of the *Satisfies
changes, thats not related to this part.

Files changed include:

src/backend/access/heap/heapam.c | 284 ++++-
src/backend/access/heap/pruneheap.c | 16 +-
src/backend/catalog/index.c | 76 +-
src/backend/access/rmgrdesc/heapdesc.c | 9 +
src/include/access/heapam_xlog.h | 23 +
src/include/catalog/index.h | 4 +

What of this? (I'm using the dellstore sample database, as always):

WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
DELETE 1

I don't have time to figure out what this issue is right now.

Its just a development debugging message that should go in the near
future. There's no primary key on orderlines so we currently cannot
safely replicate DELETEs. Its recognizable from the record that thats
the case, so we should be able to handle that "safely" during decoding,
that is we can print a warning there.

Hot Standby, Replication and libpq stuff
========================================

I take particular interest in bgwriter.c here. You're doing this:
+ 		 * Log a new xl_running_xacts every now and then so replication can get
+ 		 * into a consistent state faster and clean up resources more
+ 		 * frequently. The costs of this are relatively low, so doing it 4
+ 		 * times a minute seems fine.
What about the power consumption of the bgwriter?

I think we are not doing any additional wakeups due to this, the
complete sleeping logic is unaffected. The maximum sleep duration
currently is "BgWriterDelay * HIBERNATE_FACTOR" which is lower than the
interval in which we log new snapshots. So I don't think this should
make a measurable difference?

I think that the way try to interact with the existing loop logic is
ill-considered. Just why is the bgwriter the compelling auxiliary
process in which to do this extra work?

Which process would be a good idea otherwise? Bgwriter seemed best
suited to me, but I am certainly open to reconsideration. It really was
a process of elimination, and I don't really see a downside.

Moving that code somewhere else should be no problem, so I am open to
suggestions?

Quite a lot of code has been added to walsender. This is mostly down
to some new functions, responsible for initialising logical
replication:
! typedef void (*WalSndSendData)(bool *);
! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
static void InitWalSenderSlot(void);
static void WalSndKill(int code, Datum arg);
! static void XLogSendPhysical(bool *caughtup);
! static void XLogSendLogical(bool *caughtup);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd *cmd);
+ static void CheckLogicalReplicationRequirements(void);
+ static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
+ static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+ static void ComputeLogicalXmin(void);
This is mostly infrastructure for initialising and starting logical replication.

Initialisation means finding a free “logical slot” from shared memory,
then looping until the new magic xmin horizon for logical walsenders
(stored in their “slot”) is that of the weakest link (think local
global xmin).
+ 	 * FIXME: think about solving the race conditions in a nicer way.
+ 	 */
+ recompute_xmin:
+ 	walsnd->xmin = GetOldestXmin(true, true);
+ 	ComputeLogicalXmin();
+ 	if (walsnd->xmin != GetOldestXmin(true, true))
+ 		goto recompute_xmin;
Apart from the race conditions that I'm not confident are addressed
here, I think that the above could easily get stuck indefinitely in
the event of contention.

I don't like that part the slightest bit but I don't think its actually
in danger of looping forever. In fact I think its so broken it won't
ever loop ;). (ComputeLogicalXmin() will set the current global minimum
which will then be returned by GetOldestXmin()).

I would like to solve this properly without copying GetOldestXmin once
more (so we can compute and set the logical xmin while holding
ProcArrayLock), but I am not yet sure how a nice way to do that would
look like.

I guess GetOldestXminNoLock? That gets called while we already hold the
procarray lock? Yuck.

If we have it we should also use it for hot standby feedback.

Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
command. Initialisation also means that decoding state is allocated (a
snapshot reader is initialised), and we report back success or failure
to the client that's using the streaming replication protocol (i.e. in
our toy example, pg_receivellog).

Starting logical replication means we load the previously initialised
slot, and find a snapshot reader plugin (using the “magic symbols”
pattern described above, under “Plugin interface”).

Why do we have to “find” a logical slot twice (both during
initialisation and starting)?

Because they can happen in totally different walsenders, even after a
restart. Finding a consistent point to start decoding from can take some
time (basically you need to wait for any old snapshots to finish), you
don't want to do that every time you disconnect as you would loose
updates inbetween.

So what you do is to do INIT_LOGICAL_REPLICATION *once* when you setup a
new replica. And then you only do START_LOGICAL_REPLICATION 'slot-id'
'position'; afterwards.

Obviously that needs some work since we're not yet persisting enough
between restarts... As I said above, thats what I am working on next.

Minor point: This is a terrible name for the variable in question:

+ LogicalWalSnd *walsnd;

Why? As long as the struct is called "LogicalWalSnd" it seems to
accurate enough.

I think LocalWalSnds should emancipate themselves and be separate from
walsender, when that has happened its obviously a bad name.

Miscellaneous thoughts
======================

You're still using C stdlib functions like malloc, free, calloc quite
a bit. My concern is that this points to a lack of thought about the
memory management strategy; why are you still not using memory
contexts in some places? If it's so difficult to anticipate what
clients of, say, XLogReaderAllocate() want for the lifetime of their
memory, then likely as not those clients should be doing their own
memory allocation, and passing the allocated buffer directly. If it is
obvious that the memory ought to persist indefinitely (and I think
it's your contention that it is in the case of XLogReaderAllocate()),
I'd just allocate it in the top memory context. Now, I am aware that
there are a trivial number of backend mallocs that you can point to as
precedent here, but I'm still not satisfied with your explanation for
using malloc(). At the very least, you ought to be handling the case
where malloc returns NULL, and you're not doing so consistently.

There are different categories here. XLogReader *has* to use malloc
instead of the pg infrastructure since it needs to be usable by xlogdump
which doesn't have the pg memory infrastructure.
I would like reorderbuffer.c to stay usable outside the backend as well
(primarily for a printing tool of the spooled changes).

In contrast the use-case for snapbuild.c outside the backend is pretty
slim, so it probably grow its own memory context and use that.

Minor gripes:

* There is no need to use a *.txt extension for README files; we don't
currently use those anywhere else.

It makes it easier for me to have a generic rule to transcode them into
a different format, thats why they have it...

* If you only credit the PGDG and not the Berkeley guys (as you
should, for the most part), there is no need to phrase the notice
“Portions Copyright...”. You should just say “Copyright...”.

ok.

* You're still calling function pointer typedefs things like
LogicalDecodeInitCB. As I've already pointed out, you should prefer
the existing conventions (call it something like
LogicalDecodeInit_hook_type).

I still think CB is way much better than _hook_type, because its not a
hook in this, its a callback. A hook intercepts normal operation, thats
not what happens here. Note that we already use *CallBack in various
places. If you prefer the longer form, ok, I can do that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#106)

2 attachment(s)

Re: logical decoding - GetOldestXmin

On 2012-12-13 18:29:00 +0100, Andres Freund wrote:

On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:
Initialisation means finding a free “logical slot” from shared memory,
then looping until the new magic xmin horizon for logical walsenders
(stored in their “slot”) is that of the weakest link (think local
global xmin).
+ 	 * FIXME: think about solving the race conditions in a nicer way.
+ 	 */
+ recompute_xmin:
+ 	walsnd->xmin = GetOldestXmin(true, true);
+ 	ComputeLogicalXmin();
+ 	if (walsnd->xmin != GetOldestXmin(true, true))
+ 		goto recompute_xmin;
Apart from the race conditions that I'm not confident are addressed
here, I think that the above could easily get stuck indefinitely in
the event of contention.
I don't like that part the slightest bit but I don't think its actually
in danger of looping forever. In fact I think its so broken it won't
ever loop ;). (ComputeLogicalXmin() will set the current global minimum
which will then be returned by GetOldestXmin()).

I would like to solve this properly without copying GetOldestXmin once
more (so we can compute and set the logical xmin while holding
ProcArrayLock), but I am not yet sure how a nice way to do that would
look like.

I guess GetOldestXminNoLock? That gets called while we already hold the
procarray lock? Yuck.

Does anybody have an opinion on the attached patches? Especially 0001,
which contains the procarray changes?

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-GetOldestXminNoLock-as-a-variant-and-implementat.patchtext/x-patch; charset=us-asciiDownload

>From 660b6995c0260eb112d7b6f7158e3b4654c3d9bf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Dec 2012 20:47:57 +0100
Subject: [PATCH 1/2] Add GetOldestXminNoLock as a variant (and
 implementation) of GetOldestXmin

This is useful because it allows to compute the current OldestXmin while
already holding the procarray lock which enables setting the own xmin horizon
safely.
---
 src/backend/storage/ipc/procarray.c | 30 +++++++++++++++++++++---------
 src/include/storage/procarray.h     |  1 +
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 985350e..a704c9c 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1104,6 +1104,27 @@ TransactionIdIsActive(TransactionId xid)
 TransactionId
 GetOldestXmin(bool allDbs, bool ignoreVacuum)
 {
+	TransactionId res;
+
+	/* Cannot look for individual databases during recovery */
+	Assert(allDbs || !RecoveryInProgress());
+
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	res = GetOldestXminNoLock(allDbs, ignoreVacuum);
+	LWLockRelease(ProcArrayLock);
+	return res;
+}
+
+/*
+ * GetOldestXminNoLock -- worker routine for GetOldestXmin and others
+ *
+ * Requires ProcArrayLock to be already locked!
+ *
+ * Check GetOldestXmin for the semantics of this.
+ */
+TransactionId
+GetOldestXminNoLock(bool allDbs, bool ignoreVacuum)
+{
 	ProcArrayStruct *arrayP = procArray;
 	TransactionId result;
 	int			index;
@@ -1111,8 +1132,6 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 	/* Cannot look for individual databases during recovery */
 	Assert(allDbs || !RecoveryInProgress());
 
-	LWLockAcquire(ProcArrayLock, LW_SHARED);
-
 	/*
 	 * We initialize the MIN() calculation with latestCompletedXid + 1. This
 	 * is a lower bound for the XIDs that might appear in the ProcArray later,
@@ -1174,8 +1193,6 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 		 */
 		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
 
-		LWLockRelease(ProcArrayLock);
-
 		if (TransactionIdIsNormal(kaxmin) &&
 			TransactionIdPrecedes(kaxmin, result))
 			result = kaxmin;
@@ -1183,11 +1200,6 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 	else
 	{
 		/*
-		 * No other information needed, so release the lock immediately.
-		 */
-		LWLockRelease(ProcArrayLock);
-
-		/*
 		 * Compute the cutoff XID by subtracting vacuum_defer_cleanup_age,
 		 * being careful not to generate a "permanent" XID.
 		 *
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 9933dad..ce8d98b 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -50,6 +50,7 @@ extern RunningTransactions GetRunningTransactionData(void);
 extern bool TransactionIdIsInProgress(TransactionId xid);
 extern bool TransactionIdIsActive(TransactionId xid);
 extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
+extern TransactionId GetOldestXminNoLock(bool allDbs, bool ignoreVacuum);
 extern TransactionId GetOldestActiveTransactionId(void);
 
 extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-- 
1.7.12.289.g0ce9864.dirty

0002-wal-decoding-Use-GetOldestXminNoLock-to-compute-the-.patchtext/x-patch; charset=us-asciiDownload

>From 3f5bf1e2fad99081edfcb31601398af3b953cf15 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 13 Dec 2012 20:49:59 +0100
Subject: [PATCH 2/2] wal decoding: Use GetOldestXminNoLock to compute the
 initial logical xmin safely

---
 src/backend/replication/walsender.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c1ec0a3..2204c7a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -544,16 +544,12 @@ InitLogicalReplication(InitLogicalReplicationCmd *cmd)
 	 * cannot go backwards anymore, as ComputeLogicalXmin() nails the value
 	 * down.
 	 *
-	 * We need to do this *after* releasing the spinlock, otherwise
-	 * GetOldestXmin will deadlock with ourselves.
-	 *
-	 * FIXME: think about solving the race conditions in a nicer way.
+	 * FIXME: this should probably be in procarray.c?
 	 */
-recompute_xmin:
-	walsnd->xmin = GetOldestXmin(true, true);
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	walsnd->xmin = GetOldestXminNoLock(true, true);
+	LWLockRelease(ProcArrayLock);
 	ComputeLogicalXmin();
-	if (walsnd->xmin != GetOldestXmin(true, true))
-		goto recompute_xmin;
 
 	decoding_ctx = AllocSetContextCreate(TopMemoryContext,
 										 "decoding context",
@@ -1133,8 +1129,10 @@ ProcessStandbyReplyMessage(void)
 	}
 
 	/*
-	 * Do an unlocked check for candidate_xmin first.
+	 * Advance our local xmin horizin when the client confirmed a flush.
 	 */
+
+	/* Do an unlocked check for candidate_xmin first.*/
 	if (MyLogicalWalSnd &&
 		TransactionIdIsValid(MyLogicalWalSnd->candidate_xmin))
 	{
-- 
1.7.12.289.g0ce9864.dirty

#108

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#107)

Re: logical decoding - GetOldestXmin

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Robert Haas (#108)

Re: logical decoding - GetOldestXmin

On 2012-12-13 17:29:06 -0500, Robert Haas wrote:

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

But GetOldestXmin() should be called less frequently than
GetSnapshotData() by several orders of magnitudes. I don't really see
it being used in any really hot code paths?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Andres Freund (#109)

Re: logical decoding - GetOldestXmin

On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-12-13 17:29:06 -0500, Robert Haas wrote:

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

But GetOldestXmin() should be called less frequently than
GetSnapshotData() by several orders of magnitudes. I don't really see
it being used in any really hot code paths?

Maybe, but that calculation doesn't *need* to be inside the lock, that
is just a consequence of the current coding.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Michael Paquier

michael.paquier@gmail.com

about 13 years ago

In reply to: Robert Haas (#108)

Re: logical decoding - GetOldestXmin

On Fri, Dec 14, 2012 at 2:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

The two commits coming to my mind are:
- ed0b409 (Separate PGPROC into PGPROC and PGXACT)
- 0d76b60 (introduction of NormalTransactionIdPrecedes)
Those ones really improved concurrency performance.
--
Michael Paquier
http://michael.otacoo.com

#112

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Simon Riggs (#110)

Re: logical decoding - GetOldestXmin

On 2012-12-13 23:35:00 +0000, Simon Riggs wrote:

On 13 December 2012 22:37, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-12-13 17:29:06 -0500, Robert Haas wrote:

On Thu, Dec 13, 2012 at 3:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I can. Go look at some of the 9.2 optimizations around
GetSnapshotData(). Those made a BIG difference under heavy
concurrency and they were definitely micro-optimization. For example,
the introduction of NormalTransactionIdPrecedes() was shockingly
effective.

But GetOldestXmin() should be called less frequently than
GetSnapshotData() by several orders of magnitudes. I don't really see
it being used in any really hot code paths?

Maybe, but that calculation doesn't *need* to be inside the lock, that
is just a consequence of the current coding.

I am open to suggestion how to do that in a way we a) can hold the lock
already (to safely nail the global xmin to the current value) b) without
duplicating all the code.

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#112)

Re: logical decoding - GetOldestXmin

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all. Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here. Am I dumb?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Robert Haas (#113)

Re: logical decoding - GetOldestXmin

On 2012-12-14 14:01:30 -0500, Robert Haas wrote:

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all. Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here. Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:

Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
the actual current xmin horizon. Thats preferrable because it allows us
to start up more quickly. snapbuild.c can only start building a snapshot
once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

This also made me notice that my changes to GetSnapshotData were quite
pessimal... I do set the xmin of the new snapshot to the "logical xmin"
instead of doing it only to globalxmin/RecentGlobalXmin.

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Andres Freund (#114)

Re: logical decoding - GetOldestXmin

On 2012-12-15 01:19:26 +0100, Andres Freund wrote:

On 2012-12-14 14:01:30 -0500, Robert Haas wrote:

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all. Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here. Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:

Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
the actual current xmin horizon. Thats preferrable because it allows us
to start up more quickly. snapbuild.c can only start building a snapshot
once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

Hm. One way that could work with fewer changes is to exploit the fact
that a) it seems to be possible to acquire a shared lwlock twice in the
same backend and b) both GetOldestXmin & GetSnapshotData acquire only a
shared lwlock.

Are we willing to guarantee that recursive acquiration of shared lwlocks
continues to work?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Andres Freund (#107)

Re: logical decoding - GetOldestXmin

On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:

Does anybody have an opinion on the attached patches? Especially 0001,
which contains the procarray changes?

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I don't see why this is hard.

Just make the lock acquisition/release conditional on another parameter.

That way the only thing you'll be moving inside the lock is an if test
on a constant boolean.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Simon Riggs (#116)

Re: logical decoding - GetOldestXmin

On 2012-12-16 16:44:04 +0000, Simon Riggs wrote:

On 13 December 2012 20:03, Andres Freund <andres@2ndquadrant.com> wrote:

Does anybody have an opinion on the attached patches? Especially 0001,
which contains the procarray changes?

It moves a computation of the sort of:

result -= vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(result))
result = FirstNormalTransactionId;

inside ProcArrayLock. But I can't really imagine that to be relevant...

I don't see why this is hard.

Just make the lock acquisition/release conditional on another parameter.

That way the only thing you'll be moving inside the lock is an if test
on a constant boolean.

Thats not really cheaper. Two branches + additional parameter
passed/pushed vs one branch, one subtransaction, two assignments is a
close call.
As I don't think either really matters in the GetOldestXmin case, I
would be happy with that as well. If people prefer an additional
parameter + adjusting the few callsite vs. a separate function I will go
that way.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#114)

Re: logical decoding - GetOldestXmin

On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-12-14 14:01:30 -0500, Robert Haas wrote:

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will make a
difference. I would be happy to go an alternative path, but right now I
don't see any nice one. A "already_locked" parameter to GetOldestXmin
seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need to
do it this way at all. Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(), and
I guess I'm not seeing why that wouldn't also work here. Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:

Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer to
me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than GetSnapshotData
as the latter set a fairly new xid as xmin whereas GetOldestXmin returns
the actual current xmin horizon. Thats preferrable because it allows us
to start up more quickly. snapbuild.c can only start building a snapshot
once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

I'm a bit confused. Are you talking about the difference between
RecentGlobalXmin and RecentXmin? I think GetSnapshotData() updates
both.

Anyway, if there's no nicer way, I think it's probably OK to add a
parameter to GetOldestXmin(). It seems like kind of a hack, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

anarazel@anarazel.de

andres@anarazel.de

about 13 years ago

In reply to: Robert Haas (#118)

Re: logical decoding - GetOldestXmin

Hi,

Robert Haas <robertmhaas@gmail.com> schrieb:

On Fri, Dec 14, 2012 at 7:19 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2012-12-14 14:01:30 -0500, Robert Haas wrote:

On Fri, Dec 14, 2012 at 6:46 AM, Andres Freund

<andres@2ndquadrant.com> wrote:

Just moving that tidbit inside the lock seems to be the pragmatic
choice. GetOldestXmin is called

* once per checkpoint
* one per index build
* once in analyze
* twice per vacuum
* once for HS feedback messages

Nothing of that occurs frequently enough that 5 instructions will

make a

difference. I would be happy to go an alternative path, but right

now I

don't see any nice one. A "already_locked" parameter to

GetOldestXmin

seems to be a cure worse than the disease.

I'm not sure that would be so bad, but I guess I question the need

to

do it this way at all. Most of the time, if you need to advertise
your global xmin, you use GetSnapshotData(), not GetOldestXmin(),

and

I guess I'm not seeing why that wouldn't also work here. Am I dumb?

I wondered upthread whether that would be better:

On 2012-12-13 21:03:44 +0100, Andres Freund wrote:

Another alternative to this would be to get a snapshot with
GetSnapshotData(), copy the xmin to the logical slot, then call
ProcArrayEndTransaction(). But that doesn't really seem to be nicer

to

me.

Not sure why I considered it ugly anymore, but it actually has a
noticeable disadvantage. GetOldestXmin is nicer is than

GetSnapshotData

as the latter set a fairly new xid as xmin whereas GetOldestXmin

returns

the actual current xmin horizon. Thats preferrable because it allows

us

to start up more quickly. snapbuild.c can only start building a

snapshot

once it has seen a xl_running_xact with oldestRunningXid >=
own_xmin. Otherwise we cannot be sure that no relevant catalog tuples
have been removed.

I'm a bit confused. Are you talking about the difference between
RecentGlobalXmin and RecentXmin? I think GetSnapshotData() updates
both.

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required for decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively.

Does it make more sense now?

Andres

--- 
Please excuse the brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: anarazel@anarazel.de (#119)

Re: logical decoding - GetOldestXmin

On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required for decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively.

Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
because GetSnapshotData() also updates proc->xmin. If you're trying
to store a different value in that field then of course it matters.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Robert Haas (#120)

Re: logical decoding - GetOldestXmin

On 2012-12-18 19:56:18 -0500, Robert Haas wrote:

On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required for decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively.

Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
because GetSnapshotData() also updates proc->xmin. If you're trying
to store a different value in that field then of course it matters.

Absolutely right. I don't want to say there's anything wrong with it
right now. The "problem" for me is that it sets proc->xmin to the newest
value it can while I want/need the oldest valid value...

I will go with adding a already_locked parameter to GetOldestXmin then.

Thanks for the input,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#121)

Re: logical decoding - GetOldestXmin

On Tue, Dec 18, 2012 at 7:59 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-12-18 19:56:18 -0500, Robert Haas wrote:

On Tue, Dec 18, 2012 at 5:25 PM, anarazel@anarazel.de
<andres@anarazel.de> wrote:

The problem is that at the time GetSnapshotData returns the xmin horizon might have gone upwards and tuples required for decoding might get removed by other backends. That needs to be prevented while holding the procarray lock exclusively.

Well, for the ordinary use of GetSnapshotData(), that doesn't matter,
because GetSnapshotData() also updates proc->xmin. If you're trying
to store a different value in that field then of course it matters.

Absolutely right. I don't want to say there's anything wrong with it
right now. The "problem" for me is that it sets proc->xmin to the newest
value it can while I want/need the oldest valid value...

I will go with adding a already_locked parameter to GetOldestXmin then.

Or instead of bool already_locked, maybe bool advertise_xmin? Seems
like that might be more friendly to the abstraction boundaries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Noah Misch

noah@leadboat.com

almost 13 years ago

In reply to: Hannu Krosing (#48)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was
interested
if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality ("xml" does
not). The equality operator may have a name other than "=", and an operator
named "=" may exist with semantics other than equality ("box" is affected).
Code attempting this replication strategy should select an equality operator
the way typcache.c does so.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Hannu Krosing

hannu@2ndQuadrant.com

almost 13 years ago

In reply to: Noah Misch (#123)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 12:28 AM, Noah Misch wrote:

[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was
interested
if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality ("xml" does
not). The equality operator may have a name other than "=", and an operator
named "=" may exist with semantics other than equality ("box" is affected).
Code attempting this replication strategy should select an equality operator
the way typcache.c does so.

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

Or does "IS NOT DISTINCT FROM" already work even for types without
comparison operator ?

--------------
Hannu

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Hannu Krosing

hannu@2ndQuadrant.com

almost 13 years ago

In reply to: Hannu Krosing (#124)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 10:49 AM, Hannu Krosing wrote:

On 01/13/2013 12:28 AM, Noah Misch wrote:

[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary
key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was
interested
if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality
("xml" does
not). The equality operator may have a name other than "=", and an
operator
named "=" may exist with semantics other than equality ("box" is
affected).
Code attempting this replication strategy should select an equality
operator
the way typcache.c does so.

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

Or does "IS NOT DISTINCT FROM" already work even for types without
comparison operator ?

Just checked - it does not, it still looks for "=" operator so it is
just equality-with-nulls

How do people feel about adding a real sameness operator ?

Hannu

--------------
Hannu

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Hannu Krosing (#125)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

Hannu Krosing <hannu@2ndQuadrant.com> writes:

How do people feel about adding a real sameness operator ?

Just begs the question of "what's sameness?"

In many places we consider a datatype's default btree equality operator
to define sameness, but not all types provide a btree opclass (in
particular, anything that hasn't got a sensible one-dimensional sort
order will not). And some do but it doesn't represent anything that
anyone would want to consider "sameness" --- IIRC, some of the geometric
types provide btree opclasses that sort by area. Even for apparently
simple types like float8 there are interesting questions like whether
minus zero is the same as plus zero.

The messiness here is not just due to lack of a notation.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Markus Wanner

markus@bluegap.ch

almost 13 years ago

In reply to: Hannu Krosing (#125)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 12:30 PM, Hannu Krosing wrote:

On 01/13/2013 10:49 AM, Hannu Krosing wrote:

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

How do people feel about adding a real sameness operator ?

We'd need to define what "sameness" means. If this goes toward "exact
match in binary representation", this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A "sameness" operator could
then be implemented by simply comparing two value's send() outputs.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Andres Freund

andres@2ndquadrant.com

almost 13 years ago

In reply to: Tom Lane (#126)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 2013-01-13 12:44:44 -0500, Tom Lane wrote:

Hannu Krosing <hannu@2ndQuadrant.com> writes:

How do people feel about adding a real sameness operator ?

Just begs the question of "what's sameness?"

In many places we consider a datatype's default btree equality operator
to define sameness, but not all types provide a btree opclass (in
particular, anything that hasn't got a sensible one-dimensional sort
order will not). And some do but it doesn't represent anything that
anyone would want to consider "sameness" --- IIRC, some of the geometric
types provide btree opclasses that sort by area. Even for apparently
simple types like float8 there are interesting questions like whether
minus zero is the same as plus zero.

The messiness here is not just due to lack of a notation.

FWIW *I* (but others might) don't plan to support that case for now, it
just seems to be too messy for far too little benefit.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129

Hannu Krosing

hannu@2ndQuadrant.com

almost 13 years ago

In reply to: Markus Wanner (#127)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 06:02 PM, Markus Wanner wrote:

On 01/13/2013 12:30 PM, Hannu Krosing wrote:

On 01/13/2013 10:49 AM, Hannu Krosing wrote:

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

How do people feel about adding a real sameness operator ?

We'd need to define what "sameness" means. If this goes toward "exact
match in binary representation", this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A "sameness" operator could
then be implemented by simply comparing two value's send() outputs.

This seems like a good definition of "sameness" to me - if binary
images are bitwise same, then the values are the same. And if
both are fields of the same type and NULLs then these are also
"same"

And defining a cross-platform binary format also a good direction
of movement in implementing this.

I'd just start with what send() and recv() on each type produces
now using GCC on 64bit Intel and move towards adjusting others
to match. For a period anything else would still be allowed, but
be "non-standard"

I have no strong opinion on typed NULLs, though I'd like them
to also be "the same" for a sake of simplicity.

As this would be non-standard anyway, I'd make a row of all nulls NOT
"be the same" as NULL

This would be much easier to explain than losing the "IS NULL"-ness at
nesting level 3 ;)

Hannu

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 13 years ago

In reply to: Hannu Krosing (#125)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

Hannu Krosing <hannu@2ndQuadrant.com> writes:

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

How do people feel about adding a real sameness operator ?

Well. I would prefer it if we can bypass the need for it.

Then Do we need the full range of eq, eql, equal and equalp predicates,
and would all of them allow overriding or just some?

http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131

Hannu Krosing

hannu@2ndQuadrant.com

almost 13 years ago

In reply to: Noah Misch (#123)

Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 12:28 AM, Noah Misch wrote:

[Catching up on old threads.]

On Sat, Nov 17, 2012 at 03:40:49PM +0100, Hannu Krosing wrote:

On 11/17/2012 03:00 PM, Markus Wanner wrote:

On 11/17/2012 02:30 PM, Hannu Krosing wrote:

Is it possible to replicate UPDATEs and DELETEs without a primary key in
PostgreSQL-R

No. There must be some way to logically identify the tuple.

It can be done as selecting on _all_ attributes and updating/deleting
just the first matching row

create cursor ...
select from t ... where t.* = (....)
fetch one ...
delete where current of ...

This is on distant (round 3 or 4) roadmap for this work, just was
interested
if you had found any better way of doing this :)

That only works if every attribute's type has a notion of equality ("xml" does
not). The equality operator may have a name other than "=", and an operator
named "=" may exist with semantics other than equality ("box" is affected).
Code attempting this replication strategy should select an equality operator
the way typcache.c does so.

A method for making this work as PostgreSQL works now would be to
compare "textual representations" of tuples

create cursor ...
select from t ... where t::text = '(<image of original row>)'
fetch one ...
delete where current of ...

But of course having an operator for "sameness" without needing to convert to text
would be better

----------------
Hannu

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Hannu Krosing

hannu@2ndQuadrant.com

almost 13 years ago

In reply to: Dimitri Fontaine (#130)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 08:06 PM, Dimitri Fontaine wrote:

Hannu Krosing <hannu@2ndQuadrant.com> writes:

Does this hint that postgreSQL also needs an sameness operator
( "is" or "===" in same languages).

How do people feel about adding a real sameness operator ?

Well. I would prefer it if we can bypass the need for it.

What is actually sufficient for current problem is sameness
which compares outputs of type output functions and also
considers NULLs to be the same.

The reason for not providing equality for xml was not that two xml
files which compare equal as text could be considered unequal in
any sense but that there are some other textual representations
of the same xml which could also be considered to be equal, like
different whitespace between tag and attribute

Then Do we need the full range of eq, eql, equal and equalp predicates,
and would all of them allow overriding or just some?

I consider sameness as basic thing as IS NULL, so the sameness
should not be overridable. Extending IS NOT DISTINCT FROM to
do this comparison instead of current '=' seems reasonable.

That is

SELECT '<tag/>'::xml IS DISTINCT FROM '<tag />'::xml

should return TRUE as long as the internal representation of the
two differ and even after you add equality operator to xml
which compares some canonic form of xml and thus would make

SELECT '<tag/>'::xml = '<tag />'::xml ;

be TRUE.

Regards,
Hannu

http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node74.html

Regards,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133

Markus Wanner

markus@bluegap.ch

almost 13 years ago

In reply to: Hannu Krosing (#129)

Re: Re: logical changeset generation v3 - comparison to Postgres-R change set format

On 01/13/2013 09:04 PM, Hannu Krosing wrote:

I'd just start with what send() and recv() on each type produces
now using GCC on 64bit Intel and move towards adjusting others
to match. For a period anything else would still be allowed, but
be "non-standard"

Intel being little endian seems like a bad choice to me, given that
send/recv kind of implies network byte ordering. I'd rather not tie this
to any particular processor architecture at all (at least not solely on
the ground that it's the most common one at the time).

I have no strong opinion on "sameness" of NULLs and could also imagine
that to throw some kind of invalid operation error. Based on the ground
that neither is a value and it's unclear what send() method to use at all.

FWIW, trying to determine the length of a sent NULL gives an interesting
result that I don't currently understand.

psql (9.2.2)
Type "help" for help.

postgres=# SELECT length(int4send(NULL));
length
--------

(1 row)

postgres=# SELECT length(float4send(NULL));
length
--------

(1 row)

postgres=# SELECT length(textsend(NULL));
length
--------

(1 row)

postgres=# SELECT length(textsend(NULL) || '\000'::bytea);
length
--------

(1 row)

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers