[RFC][PATCH] wal decoding, attempt #2
Hi,
It took me far longer than I planned, its not finished, but time is running
out. I would like some feedback that I am not going astray at this point...
*I* think the general approach is sound and a good way forward that provides
the basic infrastructure for many (all?) of the scenarios we talked about
before.
Anyway, here is my next attempt at $TOPIC.
Lets start with a quick demo (via psql):
/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';
DROP TABLE IF EXISTS replication_example;
SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);
BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);
commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
(somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);
SELECT pg_current_xlog_insert_location();
---- Somewhat later ----
SELECT decode_xlog('0/1893D78', '0/18BE398');
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING: COMMIT
decode_xlog
-------------
t
(1 row)
As you can see the patchset can decode several changes made to a table even
though we used DDL on it. Not everything is handled yet, but its a prototype
after all ;)
The way this works is:
A new component called SnapshotBuilder analyzes the xlog and build a special
kind of Snapshot. This works in a somewhat similar way to the
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal
SnapshotNow semantics used for catalog access with one of the previously built
snapshots. They should behave just the same as a normal SnapshotNow would have
behaved when the tuple change was written to the xlog.
This patch doesn't provide anything that uses the new infrastructure for
anything real, but I think thats good. Lets get this into something
committable and then add new things using it!
Small overview over the individual patches that will come as separate mails:
old, Alvaro is doing this properly right now, separate thread
[01]: Add embedded list interface (header only)
A new piece of infrastructure (for k-way mergesort), pretty much untested,
good idea in general I think, not very interesting:
[02]: Add minimal binary heap implementation
Boring, old.:
[03]: Add support for a generic wal reading facility dubbed XLogReader
Boring, old, borked:
[04]: add simple xlogdump tool
Slightly changed to use (tablespace, relfilenode), possibly similar problems
to earlier, not interesting at this point.
[05]: Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)
relfilenode)
Unchanged:
[06]: Log enough data into the wal to reconstruct logical changes from it if wal_level=logical
wal_level=logical
I didn't implement proper cache handling, so I need to use the big hammer...:
[07]: Make InvalidateSystemCaches public
The major piece:
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
The snapshot building has the most critical infrastructure but misses
several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not
pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction
commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling
The decode_xlog() function is *purely* a debugging tool that I do not want to
keep in the long run. I introduced it so we can concentrate on the topic at
hand without involving even more moving parts (see the next paragraph)...
Some parts of this I would like to only discuss later, in separate threads, to
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal
I have some work ontop of this, that handles ComboCid's and CommandId's
correctly (and thus mixed ddl/dml transactions), but its simply not finished
enough. I am pretty sure by now that it works even with those additional
complexities.
So, I am unfortunately too tired to write more than this... It will have to
suffice. I plan to release a newer version with more documentation soon.
Comments about the approach or even the general direction of the
implementation? Questions?
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.
Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---
src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 253 insertions(+)
create mode 100644 src/include/utils/ilist.h
Attachments:
0001-Add-embedded-list-interface-header-only.patchtext/x-patch; name=0001-Add-embedded-list-interface-header-only.patchDownload
diff --git a/src/include/utils/ilist.h b/src/include/utils/ilist.h
new file mode 100644
index 0000000..03dae63
--- /dev/null
+++ b/src/include/utils/ilist.h
@@ -0,0 +1,253 @@
+#ifndef ILIST_H
+#define ILIST_H
+
+#ifdef __GNUC__
+#define unused_attr __attribute__((unused))
+#else
+#define unused_attr
+#endif
+
+#ifndef USE_INLINE
+#error "a compiler supporting static inlines is required"
+#endif
+
+#include <assert.h>
+
+typedef struct ilist_d_node ilist_d_node;
+
+struct ilist_d_node
+{
+ ilist_d_node* prev;
+ ilist_d_node* next;
+};
+
+typedef struct
+{
+ ilist_d_node head;
+} ilist_d_head;
+
+typedef struct ilist_s_node ilist_s_node;
+
+struct ilist_s_node
+{
+ ilist_s_node* next;
+};
+
+typedef struct
+{
+ ilist_s_node head;
+} ilist_s_head;
+
+#ifdef ILIST_DEBUG
+void ilist_d_check(ilist_d_head* head);
+#else
+static inline void ilist_d_check(ilist_d_head* head)
+{
+}
+#endif
+
+static inline void ilist_d_init(ilist_d_head *head)
+{
+ head->head.next = head->head.prev = &head->head;
+ ilist_d_check(head);
+}
+
+/*
+ * adds a node at the beginning of the list
+ */
+static inline void ilist_d_push_front(ilist_d_head *head, ilist_d_node *node)
+{
+ node->next = head->head.next;
+ node->prev = &head->head;
+ node->next->prev = node;
+ head->head.next = node;
+ ilist_d_check(head);
+}
+
+
+/*
+ * adds a node at the end of the list
+ */
+static inline void ilist_d_push_back(ilist_d_head *head, ilist_d_node *node)
+{
+ node->next = &head->head;
+ node->prev = head->head.prev;
+ node->prev->next = node;
+ head->head.prev = node;
+ ilist_d_check(head);
+}
+
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_after(unused_attr ilist_d_head *head, ilist_d_node *after, ilist_d_node *node)
+{
+ node->prev = after;
+ node->next = after->next;
+ after->next = node;
+ node->next->prev = node;
+ ilist_d_check(head);
+}
+
+/*
+ * adds a node after another *in the same list*
+ */
+static inline void ilist_d_add_before(unused_attr ilist_d_head *head, ilist_d_node *before, ilist_d_node *node)
+{
+ node->prev = before->prev;
+ node->next = before;
+ before->prev = node;
+ node->prev->next = node;
+ ilist_d_check(head);
+}
+
+
+/*
+ * removes a node from a list
+ */
+static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node *node)
+{
+ ilist_d_check(head);
+ node->prev->next = node->next;
+ node->next->prev = node->prev;
+ ilist_d_check(head);
+}
+
+/*
+ * removes the first node from a list or returns NULL
+ */
+static inline ilist_d_node* ilist_d_pop_front(ilist_d_head *head)
+{
+ ilist_d_node* ret;
+
+ if (&head->head == head->head.next)
+ return NULL;
+
+ ret = head->head.next;
+ ilist_d_remove(head, head->head.next);
+ return ret;
+}
+
+
+static inline bool ilist_d_has_next(ilist_d_head *head, ilist_d_node *node)
+{
+ return node->next != &head->head;
+}
+
+static inline bool ilist_d_has_prev(ilist_d_head *head, ilist_d_node *node)
+{
+ return node->prev != &head->head;
+}
+
+static inline bool ilist_d_is_empty(ilist_d_head *head)
+{
+ return head->head.next == &head->head;
+}
+
+#define ilist_d_front(type, membername, ptr) (&((ptr)->head) == (ptr)->head.next) ? \
+ NULL : ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_front_unchecked(type, membername, ptr) ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_d_back(type, membername, ptr) (&((ptr)->head) == (ptr)->head.prev) ? \
+ NULL : ilist_container(type, membername, (ptr)->head.prev)
+
+#define ilist_container(type, membername, ptr) ((type*)((char*)(ptr) - offsetof(type, membername)))
+
+#define ilist_d_foreach(name, ptr) for(name = (ptr)->head.next; \
+ name != &(ptr)->head; \
+ name = name->next)
+
+#define ilist_d_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next, \
+ nxt = name->next; \
+ name != &(ptr)->head \
+ ; \
+ name = nxt, nxt = name->next)
+
+static inline void ilist_s_init(ilist_s_head *head)
+{
+ head->head.next = NULL;
+}
+
+static inline void ilist_s_push_front(ilist_s_head *head, ilist_s_node *node)
+{
+ node->next = head->head.next;
+ head->head.next = node;
+}
+
+/*
+ * fails if the list is empty
+ */
+static inline ilist_s_node* ilist_s_pop_front(ilist_s_head *head)
+{
+ ilist_s_node* front = head->head.next;
+ head->head.next = head->head.next->next;
+ return front;
+}
+
+/*
+ * removes a node from a list
+ * Attention: O(n)
+ */
+static inline void ilist_s_remove(ilist_s_head *head,
+ ilist_s_node *node)
+{
+ ilist_s_node *last = &head->head;
+ ilist_s_node *cur;
+#ifndef NDEBUG
+ bool found = false;
+#endif
+ while ((cur = last->next))
+ {
+ if (cur == node)
+ {
+ last->next = cur->next;
+#ifndef NDEBUG
+ found = true;
+#endif
+ break;
+ }
+ last = cur;
+ }
+ assert(found);
+}
+
+
+static inline void ilist_s_add_after(unused_attr ilist_s_head *head,
+ ilist_s_node *after, ilist_s_node *node)
+{
+ node->next = after->next;
+ after->next = node;
+}
+
+
+static inline bool ilist_s_is_empty(ilist_s_head *head)
+{
+ return head->head.next == NULL;
+}
+
+static inline bool ilist_s_has_next(unused_attr ilist_s_head* head,
+ ilist_s_node *node)
+{
+ return node->next != NULL;
+}
+
+
+#define ilist_s_front(type, membername, ptr) (ilist_s_is_empty(ptr) ? \
+ ilist_container(type, membername, (ptr).next) : NULL
+
+#define ilist_s_front_unchecked(type, membername, ptr) \
+ ilist_container(type, membername, (ptr)->head.next)
+
+#define ilist_s_foreach(name, ptr) for(name = (ptr)->head.next; \
+ name != NULL; \
+ name = name->next)
+
+#define ilist_s_foreach_modify(name, nxt, ptr) for(name = (ptr)->head.next, \
+ nxt = name ? name->next : NULL; \
+ name != NULL; \
+ name = nxt, nxt = name ? name->next : NULL)
+
+
+#endif
This is basically untested.
---
src/backend/lib/Makefile | 2 +-
src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
src/include/lib/simpleheap.h | 91 +++++++++++++++
3 files changed, 347 insertions(+), 1 deletion(-)
create mode 100644 src/backend/lib/simpleheap.c
create mode 100644 src/include/lib/simpleheap.h
Attachments:
0002-Add-minimal-binary-heap-implementation.patchtext/x-patch; name=0002-Add-minimal-binary-heap-implementation.patchDownload
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 2e1061e..1e1bd5c 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/lib
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = dllist.o stringinfo.o
+OBJS = dllist.o simpleheap.o stringinfo.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/simpleheap.c b/src/backend/lib/simpleheap.c
new file mode 100644
index 0000000..825d0a8
--- /dev/null
+++ b/src/backend/lib/simpleheap.c
@@ -0,0 +1,255 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleheap.c
+ * A simple binary heap implementaion
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/lib/simpleheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <math.h>
+
+#include "lib/simpleheap.h"
+
+static inline int
+simpleheap_left_off(size_t i)
+{
+ return 2 * i + 1;
+}
+
+static inline int
+simpleheap_right_off(size_t i)
+{
+ return 2 * i + 2;
+}
+
+static inline int
+simpleheap_parent_off(size_t i)
+{
+ return floor((i - 1) / 2);
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off);
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off);
+
+static inline void
+simpleheap_swap(simpleheap *heap, size_t a, size_t b)
+{
+ simpleheap_kv swap;
+ swap.value = heap->values[a].value;
+ swap.key = heap->values[a].key;
+
+ heap->values[a].value = heap->values[b].value;
+ heap->values[a].key = heap->values[b].key;
+
+ heap->values[b].key = swap.key;
+ heap->values[b].value = swap.value;
+}
+
+/* sift down */
+static void
+simpleheap_sift_down(simpleheap *heap, size_t node_off)
+{
+ /* manually unrolled tail recursion */
+ while (true)
+ {
+ size_t left_off = simpleheap_left_off(node_off);
+ size_t right_off = simpleheap_right_off(node_off);
+ size_t swap_off = 0;
+
+ /* only one child can violate the heap property after a change */
+
+ /* check left child */
+ if (left_off < heap->size &&
+ heap->compare(&heap->values[left_off],
+ &heap->values[node_off]) < 0)
+ {
+ /* heap condition violated */
+ swap_off = left_off;
+ }
+
+ /* check right child */
+ if (right_off < heap->size &&
+ heap->compare(&heap->values[right_off],
+ &heap->values[node_off]) < 0)
+ {
+ /* heap condition violated */
+
+ /* swap with the smaller child */
+ if (!swap_off ||
+ heap->compare(&heap->values[right_off],
+ &heap->values[left_off]) < 0)
+ {
+ swap_off = right_off;
+ }
+ }
+
+ if (!swap_off)
+ {
+ /* heap condition fullfilled, abort */
+ break;
+ }
+
+ /* swap node with the child violating the property */
+ simpleheap_swap(heap, swap_off, node_off);
+
+ /* recurse, check child subtree */
+ node_off = swap_off;
+ }
+}
+
+/* sift up */
+static void
+simpleheap_sift_up(simpleheap *heap, size_t node_off)
+{
+ /* manually unrolled tail recursion */
+ while (true)
+ {
+ size_t parent_off = simpleheap_parent_off(node_off);
+
+ if (heap->compare(&heap->values[parent_off],
+ &heap->values[node_off]) < 0)
+ {
+ /* heap property violated */
+ simpleheap_swap(heap, node_off, parent_off);
+
+ /* recurse */
+ node_off = parent_off;
+ }
+ else
+ break;
+ }
+}
+
+simpleheap*
+simpleheap_allocate(size_t allocate)
+{
+ simpleheap* heap = palloc(sizeof(simpleheap));
+ heap->values = palloc(sizeof(simpleheap_kv) * allocate);
+ heap->size = 0;
+ heap->space = allocate;
+ return heap;
+}
+
+void
+simpleheap_free(simpleheap* heap)
+{
+ pfree(heap->values);
+ pfree(heap);
+}
+
+/* initial building of a heap */
+void
+simpleheap_build(simpleheap *heap)
+{
+ int i;
+
+ for (i = simpleheap_parent_off(heap->size - 1); i >= 0; i--)
+ {
+ simpleheap_sift_down(heap, i);
+ }
+}
+
+/*
+ * Change the
+ */
+void
+simpleheap_change_key(simpleheap *heap, void* key)
+{
+ size_t next_off = 0;
+ int ret;
+ simpleheap_kv* kv;
+
+ heap->values[0].key = key;
+
+ /* no need to do anything if there is only one element */
+ if (heap->size == 1)
+ {
+ return;
+ }
+ else if (heap->size == 2)
+ {
+ next_off = 1;
+ }
+ else
+ {
+ ret = heap->compare(
+ &heap->values[simpleheap_left_off(0)],
+ &heap->values[simpleheap_right_off(0)]);
+
+ if (ret == -1)
+ next_off = simpleheap_left_off(0);
+ else
+ next_off = simpleheap_right_off(0);
+ }
+
+ /*
+ * compare with the next key. If were still smaller we can skip
+ * restructuring heap
+ */
+ ret = heap->compare(
+ &heap->values[0],
+ &heap->values[next_off]);
+
+ if (ret == -1)
+ return;
+
+ kv = simpleheap_remove_first(heap);
+ simpleheap_add(heap, kv->key, kv->value);
+}
+
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value)
+{
+ if (heap->size + 1 == heap->space)
+ Assert("Cannot resize heaps");
+ heap->values[heap->size].key = key;
+ heap->values[heap->size++].value = value;
+}
+
+void
+simpleheap_add(simpleheap* heap, void *key, void *value)
+{
+ simpleheap_add_unordered(heap, key, value);
+ simpleheap_sift_up(heap, heap->size - 1);
+}
+
+simpleheap_kv*
+simpleheap_first(simpleheap* heap)
+{
+ if (!heap->size)
+ Assert("heap is empty");
+ return &heap->values[0];
+}
+
+
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap)
+{
+ if (heap->size == 0)
+ Assert("heap is empty");
+
+ if (heap->size == 1)
+ {
+ heap->size--;
+ return &heap->values[0];
+ }
+
+ simpleheap_swap(heap, 0, heap->size - 1);
+ simpleheap_sift_down(heap, 0);
+
+ heap->size--;
+ return &heap->values[heap->size];
+}
diff --git a/src/include/lib/simpleheap.h b/src/include/lib/simpleheap.h
new file mode 100644
index 0000000..ab2d2ea
--- /dev/null
+++ b/src/include/lib/simpleheap.h
@@ -0,0 +1,91 @@
+/*
+ * simpleheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/lib/simpleheap.h
+ */
+
+#ifndef SIMPLEHEAP_H
+#define SIMPLEHEAP_H
+
+typedef struct simpleheap_kv
+{
+ void* key;
+ void* value;
+} simpleheap_kv;
+
+typedef struct simpleheap
+{
+ size_t size;
+ size_t space;
+ /*
+ * Has to return:
+ * -1 iff a < b
+ * 0 iff a == b
+ * +1 iff a > b
+ */
+ int (*compare)(simpleheap_kv* a, simpleheap_kv* b);
+
+ simpleheap_kv *values;
+} simpleheap;
+
+simpleheap*
+simpleheap_allocate(size_t capacity);
+
+void
+simpleheap_free(simpleheap* heap);
+
+/*
+ * Add values without enforcing the heap property.
+ *
+ * simpleheap_build has to be called before relying on anything that needs a
+ * valid heap. This is mostly useful for initially filling a heap and staying
+ * in O(n) instead of O(n log n).
+ */
+void
+simpleheap_add_unordered(simpleheap* heap, void *key, void *value);
+
+/*
+ * Insert key/value pair
+ *
+ * O(log n)
+ */
+void
+simpleheap_add(simpleheap* heap, void *key, void *value);
+
+/*
+ * Returns the first element as indicated by comparisons of the ->compare()
+ * operator
+ *
+ * O(1)
+ */
+simpleheap_kv*
+simpleheap_first(simpleheap* heap);
+
+/*
+ * Returns and removes the first element as indicated by comparisons of the
+ * ->compare() operator
+ *
+ * O(log n)
+ */
+simpleheap_kv*
+simpleheap_remove_first(simpleheap* heap);
+
+void
+simpleheap_change_key(simpleheap *heap, void* newkey);
+
+
+/*
+ * make the heap fullfill the heap condition. Only needed if elements were
+ * added with simpleheap_add_unordered()
+ *
+ * O(n)
+ */
+void
+simpleheap_build(simpleheap *heap);
+
+
+#endif //SIMPLEHEAP_H
Features:
- streaming reading/writing
- filtering
- reassembly of records
Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.
Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
src/include/access/xlogreader.h | 264 ++++++++
3 files changed, 1297 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogreader.c
create mode 100644 src/include/access/xlogreader.h
Attachments:
0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchtext/x-patch; name=0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
- twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+ twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..4392b29
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,1032 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ * Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/readxlog.c
+ *
+ * NOTES
+ * Documentation about how do use this interface can be found in
+ * xlogreader.h, more specifically in the definition of the
+ * XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * more extensive validation of read records
+ * * separation of reader/writer
+ * * customizable error response
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog_internal.h"
+#include "access/transam.h"
+#include "catalog/pg_control.h"
+#include "access/xlogreader.h"
+
+/* If (very) verbose debugging is needed:
+ * #define VERBOSE_DEBUG
+ */
+
+XLogReaderState*
+XLogReaderAllocate(void)
+{
+ XLogReaderState* state = (XLogReaderState*)malloc(sizeof(XLogReaderState));
+ int i;
+
+ if (!state)
+ goto oom;
+
+ memset(&state->buf.record, 0, sizeof(XLogRecord));
+ state->buf.record_data_size = XLOG_BLCKSZ*8;
+ state->buf.record_data =
+ malloc(state->buf.record_data_size);
+
+ if (!state->buf.record_data)
+ goto oom;
+
+ memset(state->buf.record_data, 0, state->buf.record_data_size);
+ state->buf.origptr = InvalidXLogRecPtr;
+
+ for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ {
+ state->buf.bkp_block_data[i] =
+ malloc(BLCKSZ);
+
+ if (!state->buf.bkp_block_data[i])
+ goto oom;
+ }
+
+ state->is_record_interesting = NULL;
+ state->writeout_data = NULL;
+ state->finished_record = NULL;
+ state->private_data = NULL;
+ state->output_buffer_size = 0;
+
+ XLogReaderReset(state);
+ return state;
+
+oom:
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("failed while allocating an XLogReader")));
+ return NULL;
+}
+
+void
+XLogReaderFree(XLogReaderState* state)
+{
+ int i;
+
+ for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ {
+ free(state->buf.bkp_block_data[i]);
+ }
+
+ free(state->buf.record_data);
+
+ free(state);
+}
+
+void
+XLogReaderReset(XLogReaderState* state)
+{ state->in_record = false;
+ state->in_record_header = false;
+ state->do_reassemble_record = false;
+ state->in_bkp_blocks = 0;
+ state->in_bkp_block_header = false;
+ state->in_skip = false;
+ state->remaining_size = 0;
+ state->already_written_size = 0;
+ state->incomplete = false;
+ state->initialized = false;
+ state->needs_input = false;
+ state->needs_output = false;
+ state->stop_at_record_boundary = false;
+}
+
+static inline bool
+XLogReaderHasInput(XLogReaderState* state, Size size)
+{
+ XLogRecPtr tmp = state->curptr;
+ XLByteAdvance(tmp, size);
+ if (XLByteLE(state->endptr, tmp))
+ return false;
+ return true;
+}
+
+static inline bool
+XLogReaderHasOutput(XLogReaderState* state, Size size){
+ /* if we don't do output or have no limits in the output size */
+ if (state->writeout_data == NULL || state->output_buffer_size == 0)
+ return true;
+
+ if (state->already_written_size + size > state->output_buffer_size)
+ return false;
+
+ return true;
+}
+
+static inline bool
+XLogReaderHasSpace(XLogReaderState* state, Size size)
+{
+ if (!XLogReaderHasInput(state, size))
+ return false;
+
+ if (!XLogReaderHasOutput(state, size))
+ return false;
+
+ return true;
+}
+
+/* ----------------------------------------------------------------------------
+ * Write out data iff
+ * 1. we have a writeout_data callback
+ * 2. were currently behind startptr
+ *
+ * The 2nd condition requires that we will never start a write before startptr
+ * and finish after it. The code needs to guarantee this.
+ * ----------------------------------------------------------------------------
+ */
+static void
+XLogReaderInternalWrite(XLogReaderState* state, char* data, Size size)
+{
+ /* no point in doing any checks if we don't have a write callback */
+ if (!state->writeout_data)
+ return;
+
+ if (XLByteLT(state->curptr, state->startptr))
+ return;
+
+ state->writeout_data(state, data, size);
+}
+
+/*
+ * Change state so we read the next bkp block if there is one. If there is none
+ * return false so that the caller can consider the record finished.
+ */
+static bool
+XLogReaderInternalNextBkpBlock(XLogReaderState* state)
+{
+ Assert(state->in_record);
+ Assert(state->remaining_size == 0);
+
+ /*
+ * only continue with in_record=true if we have bkp block
+ */
+ while (state->in_bkp_blocks)
+ {
+ if (state->buf.record.xl_info &
+ XLR_SET_BKP_BLOCK(XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks))
+ {
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "reading bkp block %u", XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks);
+#endif
+ break;
+ }
+ state->in_bkp_blocks--;
+ }
+
+ if (!state->in_bkp_blocks)
+ return false;
+
+ /* bkp blocks are stored without regard for alignment */
+
+ state->in_bkp_block_header = true;
+ state->remaining_size = sizeof(BkpBlock);
+
+ return true;
+}
+
+void
+XLogReaderRead(XLogReaderState* state)
+{
+ state->needs_input = false;
+ state->needs_output = false;
+
+ /*
+ * Do some basic sanity checking and setup if were starting anew.
+ */
+ if (!state->initialized)
+ {
+ if (!state->read_page)
+ elog(ERROR, "The read_page callback needs to be set");
+
+ state->initialized = true;
+ /*
+ * we need to start reading at the beginning of the page to understand
+ * what we are currently reading. We will skip over that because we
+ * check curptr < startptr later.
+ */
+ state->curptr = state->startptr;
+ state->curptr -= state->startptr % XLOG_BLCKSZ;
+
+ Assert(state->curptr % XLOG_BLCKSZ == 0);
+
+ elog(LOG, "start reading from %X/%X, scrolled back to %X/%X",
+ (uint32) (state->startptr >> 32), (uint32) state->startptr,
+ (uint32) (state->curptr >> 32), (uint32) state->curptr);
+ }
+ else
+ {
+ /*
+ * We didn't finish reading the last time round. Since then new data
+ * could have been appended to the current page. So we need to update
+ * our copy of that.
+ *
+ * XXX: We could tie that to state->needs_input but that doesn't seem
+ * worth the complication atm.
+ */
+ XLogRecPtr rereadptr = state->curptr;
+ rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+ XLByteAdvance(rereadptr, SizeOfXLogShortPHD);
+
+ if(!XLByteLE(rereadptr, state->endptr))
+ goto not_enough_input;
+
+ rereadptr -= rereadptr % XLOG_BLCKSZ;
+
+ state->read_page(state, state->cur_page, rereadptr);
+
+ /*
+ * we will only rely on this data being valid if we are allowed to read
+ * that far, so its safe to just always read the header. read_page
+ * always returns a complete page even though its contents may be
+ * invalid.
+ */
+ state->page_header = (XLogPageHeader)state->cur_page;
+ state->page_header_size = XLogPageHeaderSize(state->page_header);
+ }
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "starting reading for %X/%X from %X/%X",
+ (uint32)(state->startptr >> 32), (uint32) state->startptr,
+ (uint32)(state->curptr >> 32), (uint32) state->curptr);
+#endif
+ /*
+ * Iterate over the data and reassemble it until we reached the end of the
+ * data. As we advance curptr inside the loop we need to recheck whether we
+ * have space inside as well.
+ */
+ while (XLByteLT(state->curptr, state->endptr))
+ {
+ /* how much space is left in the current block */
+ uint32 len_in_block;
+
+ /*
+ * did we read a partial xlog record due to input/output constraints?
+ * If yes, we need to signal that to the caller so it can be handled
+ * sensibly there. E.g. by waiting on a latch till more xlog is
+ * available.
+ */
+ bool partial_read = false;
+ bool partial_write = false;
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "one loop start: record: %u header %u, skip: %u bkb_block: %d in_bkp_header: %u curptr: %X/%X remaining: %u, off: %u",
+ state->in_record, state->in_record_header, state->in_skip,
+ state->in_bkp_blocks, state->in_bkp_block_header,
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ state->remaining_size,
+ (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+
+ /*
+ * at a page boundary, read the header
+ */
+ if (state->curptr % XLOG_BLCKSZ == 0)
+ {
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "reading page header, at %X/%X",
+ (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+ /*
+ * check whether we can read enough to see the short header, we
+ * need to read the short header's xlp_info to know whether this is
+ * a short or a long header.
+ */
+ if (!XLogReaderHasInput(state, SizeOfXLogShortPHD))
+ goto not_enough_input;
+
+ state->read_page(state, state->cur_page, state->curptr);
+ state->page_header = (XLogPageHeader)state->cur_page;
+ state->page_header_size = XLogPageHeaderSize(state->page_header);
+
+ /* check that we have enough space to read/write the full header */
+ if (!XLogReaderHasInput(state, state->page_header_size))
+ goto not_enough_input;
+
+ if (!XLogReaderHasOutput(state, state->page_header_size))
+ goto not_enough_output;
+
+ XLogReaderInternalWrite(state, state->cur_page, state->page_header_size);
+
+ XLByteAdvance(state->curptr, state->page_header_size);
+
+ if (state->page_header->xlp_info & XLP_FIRST_IS_CONTRECORD)
+ {
+ if (!state->in_record)
+ {
+ /*
+ * we need to support this case for initializing a cluster
+ * because we need to read/writeout a full page but there
+ * may be none without records being split across.
+ *
+ * If we are before startptr there is nothing special about
+ * this case. Most pages start with a contrecord.
+ */
+ if(!XLByteLT(state->curptr, state->startptr))
+ {
+ elog(WARNING, "contrecord although we are not in a record at %X/%X, starting at %X/%X",
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ (uint32)(state->startptr >> 32), (uint32)state->startptr);
+ }
+ state->in_record = true;
+ state->check_crc = false;
+ state->do_reassemble_record = false;
+ state->remaining_size = state->page_header->xlp_rem_len;
+ continue;
+ }
+ else
+ {
+ if (state->page_header->xlp_rem_len < state->remaining_size)
+ elog(PANIC, "remaining length is smaller than to be read data. xlp_rem_len: %u needed: %u",
+ state->page_header->xlp_rem_len, state->remaining_size
+ );
+ }
+ }
+ else if (state->in_record)
+ {
+ elog(PANIC, "no contrecord although were in a record that continued onto the next page. info %hhu at page %X/%X",
+ state->page_header->xlp_info,
+ (uint32)(state->page_header->xlp_pageaddr >> 32),
+ (uint32)state->page_header->xlp_pageaddr);
+ }
+ }
+
+ /*
+ * If a record will start next, skip over alignment padding.
+ */
+ if (!state->in_record)
+ {
+ /*
+ * a record must be stored aligned. So skip as far we need to
+ * comply with that.
+ */
+ Size skiplen;
+ skiplen = MAXALIGN(state->curptr) - state->curptr;
+
+ if (skiplen)
+ {
+ if (!XLogReaderHasSpace(state, skiplen))
+ {
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "not aligning bc of space");
+#endif
+ /*
+ * We don't have enough space to read/write the alignment
+ * bytes, so fake up a skip-state
+ */
+ state->in_record = true;
+ state->check_crc = false;
+ state->in_skip = true;
+ state->remaining_size = skiplen;
+
+ if (!XLogReaderHasInput(state, skiplen))
+ goto not_enough_input;
+ goto not_enough_output;
+ }
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "aligning from %X/%X to %X/%X, skips %lu",
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ (uint32)((state->curptr + skiplen) >> 32),
+ (uint32)(state->curptr + skiplen),
+ skiplen
+ );
+#endif
+ XLogReaderInternalWrite(state, NULL, skiplen);
+
+ XLByteAdvance(state->curptr, skiplen);
+
+ /*
+ * full pages are not treated as continuations, so restart on
+ * the beginning of the new page.
+ */
+ if ((state->curptr % XLOG_BLCKSZ) == 0)
+ continue;
+ }
+ }
+
+ /*
+ * --------------------------------------------------------------------
+ * Start to read a record
+ * --------------------------------------------------------------------
+ */
+ if (!state->in_record)
+ {
+ state->in_record = true;
+ state->in_record_header = true;
+ state->check_crc = true;
+
+ /*
+ * If the record starts before startptr were not interested in its
+ * contents. There is also no point in reassembling if were not
+ * analyzing the contents.
+ *
+ * If every record needs to be processed by finish_record restarts
+ * need to be started after the end of the last record.
+ *
+ * See state->restart_ptr for that point.
+ */
+ if ((state->finished_record == NULL &&
+ !state->stop_at_record_boundary) ||
+ XLByteLT(state->curptr, state->startptr)){
+ state->do_reassemble_record = false;
+ }
+ else
+ state->do_reassemble_record = true;
+
+ state->remaining_size = SizeOfXLogRecord;
+
+ /*
+ * we quickly loose the original address of a record as we can skip
+ * records and such, so keep the original addresses.
+ */
+ state->buf.origptr = state->curptr;
+
+ INIT_CRC32(state->next_crc);
+ }
+
+ Assert(state->in_record);
+
+ /*
+ * Compute how much space on the current page is left and how much of
+ * that we actually are interested in.
+ */
+
+ /* amount of space on page */
+ if (state->curptr % XLOG_BLCKSZ == 0)
+ len_in_block = 0;
+ else
+ len_in_block = XLOG_BLCKSZ - (state->curptr % XLOG_BLCKSZ);
+
+ /* we have more data available than we need, so read only as much as needed */
+ if (len_in_block > state->remaining_size)
+ len_in_block = state->remaining_size;
+
+ /*
+ * Handle constraints set by startptr, endptr and the size of the
+ * output buffer.
+ *
+ * Normally we use XLogReaderHasSpace for that, but thats not
+ * convenient here because we want to read data in parts. It also
+ * doesn't handle splitting around startptr. So, open-code the logic
+ * for that.
+ */
+
+ /* to make sure we always writeout in the same chunks, split at startptr */
+ if (XLByteLT(state->curptr, state->startptr) &&
+ (state->curptr + len_in_block) > state->startptr )
+ {
+#ifdef VERBOSE_DEBUG
+ Size cur_len = len_in_block;
+#endif
+ len_in_block = state->startptr - state->curptr;
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "truncating len_in_block due to startptr from %lu to %u",
+ cur_len, len_in_block);
+#endif
+ }
+
+ /* do we have enough valid data to read the current block? */
+ if (state->curptr + len_in_block > state->endptr)
+ {
+#ifdef VERBOSE_DEBUG
+ Size cur_len = len_in_block;
+#endif
+ len_in_block = state->endptr - state->curptr;
+ partial_read = true;
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "truncating len_in_block due to endptr %X/%X %lu to %i at %X/%X",
+ (uint32)(state->startptr >> 32), (uint32)state->startptr,
+ cur_len, len_in_block,
+ (uint32)(state->curptr >> 32), (uint32)state->curptr);
+#endif
+ }
+
+ /* can we write what we read? */
+ if (state->writeout_data != NULL && state->output_buffer_size != 0
+ && len_in_block > (state->output_buffer_size - state->already_written_size))
+ {
+#ifdef VERBOSE_DEBUG
+ Size cur_len = len_in_block;
+#endif
+ len_in_block = state->output_buffer_size - state->already_written_size;
+ partial_write = true;
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "truncating len_in_block due to output_buffer_size %lu to %i",
+ cur_len, len_in_block);
+#endif
+ }
+
+ /* --------------------------------------------------------------------
+ * copy data of the size determined above to whatever we are currently
+ * reading.
+ * --------------------------------------------------------------------
+ */
+
+ /* nothing to do if were skipping */
+ if (state->in_skip)
+ {
+ /* writeout zero data, original content is boring */
+ XLogReaderInternalWrite(state, NULL, len_in_block);
+
+ /*
+ * we may not need this here because were skipping over something
+ * really uninteresting but keeping track of that would be
+ * unnecessarily complicated.
+ */
+ COMP_CRC32(state->next_crc,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+ }
+ /* reassemble the XLogRecord struct, quite likely in one-go */
+ else if (state->in_record_header)
+ {
+ /*
+ * Need to clampt o sizeof(XLogRecord), we don't have the padding
+ * in buf.record...
+ */
+ Size already_written = SizeOfXLogRecord - state->remaining_size;
+ Size padding_size = SizeOfXLogRecord - sizeof(XLogRecord);
+ Size copysize = len_in_block;
+
+ if (state->remaining_size - len_in_block < padding_size)
+ copysize = Max(0, state->remaining_size - (int)padding_size);
+
+ memcpy((char*)&state->buf.record + already_written,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ copysize);
+
+ XLogReaderInternalWrite(state,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "copied part of the record. len_in_block %u, remaining: %u",
+ len_in_block, state->remaining_size);
+#endif
+ }
+ /*
+ * copy data into the current backup block header so we have enough
+ * knowledge to read the actual backup block afterwards
+ */
+ else if (state->in_bkp_block_header)
+ {
+ int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+ BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+
+ Assert(state->in_bkp_blocks);
+
+ memcpy((char*)bkpb + sizeof(BkpBlock) - state->remaining_size,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+ XLogReaderInternalWrite(state,
+ state->cur_page + ((uint32)state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+ COMP_CRC32(state->next_crc,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "copying bkp header for block %d, %u bytes, complete %lu at %X/%X rem %u",
+ blockno, len_in_block, sizeof(BkpBlock),
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ state->remaining_size);
+
+ if (state->remaining_size == len_in_block)
+ {
+ elog(LOG, "block off %u len %u", bkpb->hole_offset, bkpb->hole_length);
+ }
+#endif
+ }
+ /*
+ * Reassemble the current backup block, those usually are the biggest
+ * parts of individual XLogRecords so this might take several rounds.
+ */
+ else if (state->in_bkp_blocks)
+ {
+ int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+ BkpBlock* bkpb = &state->buf.bkp_block[blockno];
+ char* data = state->buf.bkp_block_data[blockno];
+
+ if (state->do_reassemble_record)
+ {
+ memcpy(data + BLCKSZ - bkpb->hole_length - state->remaining_size,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+ }
+
+ XLogReaderInternalWrite(state,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+ COMP_CRC32(state->next_crc,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "copying %u bytes of data for bkp block %d, complete %u",
+ len_in_block, blockno, state->remaining_size);
+#endif
+ }
+ /*
+ * read the (rest) of the XLogRecord's data. Note that this is not the
+ * XLogRecord struct itself!
+ */
+ else if (state->in_record)
+ {
+ if (state->do_reassemble_record)
+ {
+ if(state->buf.record_data_size < state->buf.record.xl_len){
+ state->buf.record_data_size = state->buf.record.xl_len;
+ state->buf.record_data =
+ realloc(state->buf.record_data,
+ state->buf.record_data_size);
+ if(!state->buf.record_data)
+ elog(ERROR, "could not allocate memory for contents of an xlog record");
+ }
+
+ memcpy(state->buf.record_data
+ + state->buf.record.xl_len
+ - state->remaining_size,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+ }
+ XLogReaderInternalWrite(state,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+
+ COMP_CRC32(state->next_crc,
+ state->cur_page + (state->curptr % XLOG_BLCKSZ),
+ len_in_block);
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "copying %u bytes into a record at off %u",
+ len_in_block, (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+ }
+
+ /* should handle wrapping around to next page */
+ XLByteAdvance(state->curptr, len_in_block);
+
+ /* do the math of how much we need to read next round */
+ state->remaining_size -= len_in_block;
+
+ /*
+ * --------------------------------------------------------------------
+ * we completed whatever we were reading. So, handle going to the next
+ * state.
+ * --------------------------------------------------------------------
+ */
+ if (state->remaining_size == 0)
+ {
+ /* completed reading - and potentially reassembling - the record */
+ if (state->in_record_header)
+ {
+ state->in_record_header = false;
+
+ /* ------------------------------------------------------------
+ * normally we don't look at the content of xlog records here,
+ * XLOG_SWITCH is a special case though, as everything left in
+ * that segment won't be sensbible content.
+ * So skip to the next segment.
+ * ------------------------------------------------------------
+ */
+ if (state->buf.record.xl_rmid == RM_XLOG_ID
+ && (state->buf.record.xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+ {
+ /*
+ * Pretend the current data extends to end of segment
+ */
+ elog(LOG, "XLOG_SWITCH");
+ state->curptr += XLogSegSize - 1;
+ state->curptr -= state->curptr % XLogSegSize;
+
+ state->in_record = false;
+ Assert(!state->in_bkp_blocks);
+ Assert(!state->in_skip);
+ continue;
+ }
+ else if (state->is_record_interesting == NULL ||
+ state->is_record_interesting(state, &state->buf.record))
+ {
+ state->remaining_size = state->buf.record.xl_len;
+ Assert(state->in_bkp_blocks == 0);
+ Assert(!state->in_bkp_block_header);
+ Assert(!state->in_skip);
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "found interesting record at %X/%X, prev: %X/%X, rmid %hhu, tx %u, len %u tot %u",
+ (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+ (uint32)(state->buf.record.xl_prev >> 32), (uint32)(state->buf.record.xl_prev),
+ state->buf.record.xl_rmid, state->buf.record.xl_xid,
+ state->buf.record.xl_len, state->buf.record.xl_tot_len);
+#endif
+
+ }
+ /* ------------------------------------------------------------
+ * ok, everybody aggrees, the content of the current record are
+ * just plain boring. So fake-up a record that replaces it with
+ * a NOOP record.
+ *
+ * FIXME: we should allow "compressing" the output here. That
+ * is write something that shows how long the record should be
+ * if everything is decompressed again. This can radically
+ * reduce space-usage over the wire.
+ * It could also be very useful for traditional SR by removing
+ * unneded BKP blocks from being transferred. For that we
+ * would need to recompute CRCs though, which we currently
+ * don't support.
+ * ------------------------------------------------------------
+ */
+ else
+ {
+ /*
+ * we need to fix up a fake record with correct length that
+ * can be written out.
+ */
+ XLogRecord spacer;
+
+ elog(LOG, "found boring record at %X/%X, rmid %hhu, tx %u, len %u tot %u",
+ (uint32)(state->buf.origptr >> 32), (uint32)state->buf.origptr,
+ state->buf.record.xl_rmid, state->buf.record.xl_xid,
+ state->buf.record.xl_len, state->buf.record.xl_tot_len);
+
+ /*
+ * xl_tot_len contains the size of the XLogRecord itself,
+ * we read that already though.
+ */
+ state->remaining_size = state->buf.record.xl_tot_len
+ - SizeOfXLogRecord;
+
+ state->in_record = true;
+ state->check_crc = true;
+ state->in_bkp_blocks = 0;
+ state->in_skip = true;
+
+ spacer.xl_prev = state->buf.origptr;
+ spacer.xl_xid = InvalidTransactionId;
+ spacer.xl_tot_len = state->buf.record.xl_tot_len;
+ spacer.xl_len = state->buf.record.xl_tot_len - SizeOfXLogRecord;
+ spacer.xl_rmid = RM_XLOG_ID;
+ spacer.xl_info = XLOG_NOOP;
+
+ XLogReaderInternalWrite(state, (char*)&spacer,
+ sizeof(XLogRecord));
+
+ /*
+ * write out the padding in a separate write, otherwise we
+ * would overrun the stack
+ */
+ XLogReaderInternalWrite(state, NULL,
+ SizeOfXLogRecord - sizeof(XLogRecord));
+
+ }
+ }
+ /*
+ * in the in_skip case we already read backup blocks because we
+ * likely read record->xl_tot_len, so everything is finished.
+ */
+ else if (state->in_skip)
+ {
+ state->in_record = false;
+ state->in_bkp_blocks = 0;
+ state->in_skip = false;
+ /* alignment is handled when starting to read a record */
+ }
+ /*
+ * We read the header of the current block. Start reading the
+ * content of that now.
+ */
+ else if (state->in_bkp_block_header)
+ {
+ BkpBlock* bkpb;
+ int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+
+ Assert(state->in_bkp_blocks);
+
+ bkpb = &state->buf.bkp_block[blockno];
+
+ if(bkpb->hole_length >= BLCKSZ)
+ {
+ elog(ERROR, "hole_length of block %u is %u but maximum is %u",
+ blockno, bkpb->hole_length, BLCKSZ);
+ }
+
+ if(bkpb->hole_offset >= BLCKSZ)
+ {
+ elog(ERROR, "hole_offset of block %u is %u but maximum is %u",
+ blockno, bkpb->hole_offset, BLCKSZ);
+ }
+
+ state->remaining_size = BLCKSZ - bkpb->hole_length;
+ state->in_bkp_block_header = false;
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "completed reading of header for %d, reading data now %u hole %u, off %u",
+ blockno, state->remaining_size, bkpb->hole_length,
+ bkpb->hole_offset);
+#endif
+ }
+ /*
+ * The current backup block is finished, more could be following
+ */
+ else if (state->in_bkp_blocks)
+ {
+ int blockno = XLR_MAX_BKP_BLOCKS - state->in_bkp_blocks;
+ BkpBlock* bkpb;
+ char* bkpb_data;
+
+ Assert(!state->in_bkp_block_header);
+
+ bkpb = &state->buf.bkp_block[blockno];
+ bkpb_data = state->buf.bkp_block_data[blockno];
+
+ /*
+ * reassemble block to its entirety by removing the bkp_hole
+ * "compression"
+ */
+ if(bkpb->hole_length){
+ memmove(bkpb_data + bkpb->hole_offset,
+ bkpb_data + bkpb->hole_offset + bkpb->hole_length,
+ BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
+ memset(bkpb_data + bkpb->hole_offset,
+ 0,
+ bkpb->hole_length);
+ }
+
+ state->in_bkp_blocks--;
+
+ state->in_skip = false;
+
+ if(!XLogReaderInternalNextBkpBlock(state))
+ goto all_bkp_finished;
+
+ }
+ /*
+ * read a non-skipped record, start reading bkp blocks afterwards
+ */
+ else if (state->in_record)
+ {
+ Assert(!state->in_skip);
+
+ state->in_bkp_blocks = XLR_MAX_BKP_BLOCKS;
+
+ if(!XLogReaderInternalNextBkpBlock(state))
+ goto all_bkp_finished;
+ }
+ }
+ /*
+ * Something could only be partially read inside a single block because
+ * of input or output space constraints..
+ */
+ else if (partial_read)
+ {
+ partial_read = false;
+ goto not_enough_input;
+ }
+ else if (partial_write)
+ {
+ partial_write = false;
+ goto not_enough_output;
+ }
+ /*
+ * Data continues into the next block.
+ */
+ else
+ {
+ }
+
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "one loop end: record: %u header: %u, skip: %u bkb_block: %d in_bkp_header: %u curpos: %X/%X remaining: %u, off: %u",
+ state->in_record, state->in_record_header, state->in_skip,
+ state->in_bkp_blocks, state->in_bkp_block_header,
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ state->remaining_size,
+ (uint32)(state->curptr % XLOG_BLCKSZ));
+#endif
+ continue;
+
+ /*
+ * we fully read a record. Process its contents if needed and start
+ * reading the next record afterwards
+ */
+ all_bkp_finished:
+ {
+ Assert(state->in_record);
+ Assert(!state->in_skip);
+ Assert(!state->in_bkp_block_header);
+ Assert(!state->in_bkp_blocks);
+
+ state->in_record = false;
+
+ /* compute and verify crc */
+ COMP_CRC32(state->next_crc,
+ &state->buf.record,
+ offsetof(XLogRecord, xl_crc));
+
+ FIN_CRC32(state->next_crc);
+
+ if (state->check_crc &&
+ state->next_crc != state->buf.record.xl_crc) {
+ elog(ERROR, "crc mismatch: newly computed : %x, existing is %x",
+ state->next_crc, state->buf.record.xl_crc);
+ }
+
+ /*
+ * if we haven't reassembled the record there is no point in
+ * calling the finished callback because we do not have any
+ * interesting data. do_reassemble_record is false if we don't have
+ * a finished_record callback.
+ */
+ if (state->do_reassemble_record)
+ {
+ /* in stop_at_record_boundary thats a valid case */
+ if (state->finished_record)
+ {
+ state->finished_record(state, &state->buf);
+ }
+
+ if (state->stop_at_record_boundary)
+ goto out;
+ }
+
+ /* alignment is handled when starting to read a record */
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "finished record at %X/%X to %X/%X, already_written_size: %lu, reas = %d",
+ (uint32)(state->curptr >> 32), (uint32)state->curptr,
+ (uint32)(state->endptr >> 32), (uint32)state->endptr,
+ state->already_written_size, state->do_reassemble_record);
+#endif
+
+ }
+ }
+out:
+ /*
+ * we are finished, check whether we finished everything, this may be
+ * useful for the caller.
+ */
+ if (state->in_skip)
+ {
+ state->incomplete = true;
+ }
+ else if (state->in_record)
+ {
+ state->incomplete = true;
+ }
+ else
+ {
+ state->incomplete = false;
+ }
+ return;
+
+not_enough_input:
+ /* signal we need more xlog and finish */
+ state->needs_input = true;
+ goto out;
+
+not_enough_output:
+ /* signal we need more space to write output to */
+ state->needs_output = true;
+ goto out;
+}
+
+XLogRecordBuffer*
+XLogReaderReadOne(XLogReaderState* state)
+{
+ bool was_set_to_stop = state->stop_at_record_boundary;
+ XLogRecPtr last_record = state->buf.origptr;
+
+ if (!was_set_to_stop)
+ state->stop_at_record_boundary = true;
+
+ XLogReaderRead(state);
+
+ if (!was_set_to_stop)
+ state->stop_at_record_boundary = false;
+
+ /* check that we fully read it and that its not the same as the last one */
+ if (state->incomplete ||
+ XLByteEQ(last_record, state->buf.origptr))
+ return NULL;
+
+ return &state->buf;
+}
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..f45c90b
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,264 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ * Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/readxlog.h
+ *
+ * NOTES
+ * Check the definition of the XLogReaderState struct for instructions on
+ * how to use the XLogReader infrastructure.
+ *
+ * The basic idea is to allocate an XLogReaderState via
+ * XLogReaderAllocate, fill out the wanted callbacks, set startptr/endptr
+ * and call XLogReaderRead(state). That will iterate over the record as
+ * long as it has enough input to reassemble a record calling
+ * is_interesting/finish_record for every record found.
+ *-------------------------------------------------------------------------
+ */
+#ifndef READXLOG_H
+#define READXLOG_H
+
+#include "access/xlog_internal.h"
+
+/*
+ * Used to store a reassembled record.
+ */
+typedef struct XLogRecordBuffer
+{
+ /* the record itself */
+ XLogRecord record;
+
+ /* at which LSN was that record found at */
+ XLogRecPtr origptr;
+
+ /* the data for xlog record */
+ char* record_data;
+ uint32 record_data_size;
+
+ BkpBlock bkp_block[XLR_MAX_BKP_BLOCKS];
+ char* bkp_block_data[XLR_MAX_BKP_BLOCKS];
+} XLogRecordBuffer;
+
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+typedef bool (*XLogReaderStateInterestingCB)(struct XLogReaderState* state,
+ XLogRecord* r);
+typedef void (*XLogReaderStateWriteoutCB)(struct XLogReaderState* state,
+ char* data, Size length);
+typedef void (*XLogReaderStateFinishedRecordCB)(struct XLogReaderState* state,
+ XLogRecordBuffer* buf);
+typedef void (*XLogReaderStateReadPageCB)(struct XLogReaderState* state,
+ char* cur_page, XLogRecPtr at);
+
+typedef struct XLogReaderState
+{
+ /* ----------------------------------------
+ * Public parameters
+ * ----------------------------------------
+ */
+
+ /* callbacks */
+
+ /*
+ * Called to decide whether a xlog record is interesting and should be
+ * assembled, analyzed (finished_record) and written out or skipped.
+ *
+ * Gets passed the current state as the first parameter and and the record
+ * *header* to decide over as the second.
+ *
+ * Return false to skip the record - and output a NOOP record instead - and
+ * true to reassemble it fully.
+ *
+ * If set to NULL every record is considered to be interesting.
+ */
+ XLogReaderStateInterestingCB is_record_interesting;
+
+ /*
+ * Writeout xlog data.
+ *
+ * The 'state' parameter is passed as the first parameter and a pointer to
+ * the 'data' and its 'length' as second and third paramter. If the 'data'
+ * is NULL zeroes need to be written out.
+ */
+ XLogReaderStateWriteoutCB writeout_data;
+
+ /*
+ * If set to anything but NULL this callback gets called after a record,
+ * including the backup blocks, has been fully reassembled.
+ *
+ * The first parameter is the current 'state'. 'buf', an XLogRecordBuffer,
+ * gets passed as the second parameter and contains the record header, its
+ * data, original position/lsn and backup block.
+ */
+ XLogReaderStateFinishedRecordCB finished_record;
+
+ /*
+ * Data input function.
+ *
+ * This callback *has* to be implemented.
+ *
+ * Has to read XLOG_BLKSZ bytes that are at the location 'at' into the
+ * memory pointed at by cur_page although everything behind endptr does not
+ * have to be valid.
+ */
+ XLogReaderStateReadPageCB read_page;
+
+ /*
+ * this can be used by the caller to pass state to the callbacks without
+ * using global variables or such ugliness. It will neither be read or set
+ * by anything but your code.
+ */
+ void* private_data;
+
+
+ /* from where to where are we reading */
+
+ /* so we know where interesting data starts after scrolling back to the beginning of a page */
+ XLogRecPtr startptr;
+
+ /* continue up to here in this run */
+ XLogRecPtr endptr;
+
+ /*
+ * size of the output buffer, if set to zero (default), there is no limit
+ * in the output buffer size.
+ */
+ Size output_buffer_size;
+
+ /*
+ * Stop reading and return after every completed record.
+ */
+ bool stop_at_record_boundary;
+
+ /* ----------------------------------------
+ * output parameters
+ * ----------------------------------------
+ */
+
+ /* we need new input data - a later endptr - to continue reading */
+ bool needs_input;
+
+ /* we need new output space to continue reading */
+ bool needs_output;
+
+ /* track our progress */
+ XLogRecPtr curptr;
+
+ /*
+ * are we in the middle of something? This is useful for the outside to
+ * know whether to start reading anew
+ */
+ bool incomplete;
+
+ /* ----------------------------------------
+ * private/internal state
+ * ----------------------------------------
+ */
+
+ char cur_page[XLOG_BLCKSZ];
+ XLogPageHeader page_header;
+ uint32 page_header_size;
+ XLogRecordBuffer buf;
+ pg_crc32 next_crc;
+
+ /* ----------------------------------------
+ * state machine variables
+ * ----------------------------------------
+ */
+
+ bool initialized;
+
+ /* are we currently reading a record? */
+ bool in_record;
+
+ /* are we currently reading a record header? */
+ bool in_record_header;
+
+ /* do we want to reassemble the record or just read/write it? */
+ bool do_reassemble_record;
+
+ /* how many bkp blocks remain to be read? */
+ int in_bkp_blocks;
+
+ /*
+ * the header of a bkp block can be split across pages, so we need to
+ * support reading that incrementally
+ */
+ bool in_bkp_block_header;
+
+ /*
+ * We are not interested in the contents of the `remaining_size` next
+ * blocks. Don't read their contents and write out zeroes instead.
+ */
+ bool in_skip;
+
+ /*
+ * Should we check the crc of the currently read record? In some situations
+ * - e.g. if we just skip till the start of a record - this doesn't make
+ * sense.
+ *
+ * This needs to be separate from in_skip because we want to be able not
+ * writeout records but still verify those. E.g. records that are "not
+ * interesting".
+ */
+ bool check_crc;
+
+ /* how much more to read in the current state */
+ uint32 remaining_size;
+
+ /* size of already written data */
+ Size already_written_size;
+
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState* XLogReaderAllocate(void);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState*);
+
+/*
+ * Reset internal state so it can be used without continuing from the last
+ * state.
+ *
+ * The callbacks and private_data won't be reset
+ */
+extern void XLogReaderReset(XLogReaderState* state);
+
+/*
+ * Read the xlog and call the appropriate callbacks as far as possible within
+ * the constraints of input data (startptr, endptr) and output space.
+ */
+extern void XLogReaderRead(XLogReaderState* state);
+
+/*
+ * Read the next xlog record if enough input/output is available.
+ *
+ * This is a bit less efficient than XLogReaderRead.
+ *
+ * Returns NULL if the next record couldn't be read for some reason. Check
+ * state->incomplete, ->needs_input, ->needs_output.
+ *
+ * Be careful to check that there is anything further to read when using
+ * ->endptr, otherwise its easy to get in an endless loop.
+ */
+extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);
+
+#endif /* READXLOG_H */
---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 ++++
src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 360 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c
Attachments:
0004-add-simple-xlogdump-tool.patchtext/x-patch; name=0004-add-simple-xlogdump-tool.patchDownload
diff --git a/src/bin/Makefile b/src/bin/Makefile
index b4dfdba..9992f7a 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -14,7 +14,7 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global
SUBDIRS = initdb pg_ctl pg_dump \
- psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup
+ psql scripts pg_config pg_controldata pg_resetxlog pg_basebackup xlogdump
ifeq ($(PORTNAME), win32)
SUBDIRS += pgevent
diff --git a/src/bin/xlogdump/Makefile b/src/bin/xlogdump/Makefile
new file mode 100644
index 0000000..d54640a
--- /dev/null
+++ b/src/bin/xlogdump/Makefile
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/xlogdump
+#
+# Copyright (c) 1998-2012, PostgreSQL Global Development Group
+#
+# src/bin/pg_resetxlog/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "xlogdump"
+PGAPPICON=win32
+
+subdir = src/bin/xlogdump
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS= xlogdump.o \
+ $(WIN32RES)
+
+all: xlogdump
+
+
+xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name objfiles.txt|xargs cat|tr -s " " "\012"|grep -v /main.o|sed 's/^/..\/..\/..\//')
+ $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
diff --git a/src/bin/xlogdump/xlogdump.c b/src/bin/xlogdump/xlogdump.c
new file mode 100644
index 0000000..8e13193
--- /dev/null
+++ b/src/bin/xlogdump/xlogdump.c
@@ -0,0 +1,334 @@
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlogreader.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+
+/*
+ * needs to be declared because otherwise its defined in main.c which we cannot
+ * link from here.
+ */
+const char *progname = "xlogdump";
+
+static void
+XLogDumpXLogRead(char *buf, TimeLineID timeline_id, XLogRecPtr startptr, Size count);
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, XLogRecPtr startptr,
+ char *buf, Size count);
+
+#define XLogFilePathWrite(path, base, tli, logSegNo) \
+ snprintf(path, MAXPGPATH, "%s/%08X%08X%08X", base, tli, \
+ (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
+ (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
+
+static void
+XLogDumpXLogWrite(const char *directory, TimeLineID timeline_id, XLogRecPtr startptr,
+ char *buf, Size count)
+{
+ char *p;
+ XLogRecPtr recptr;
+ Size nbytes;
+
+ static int sendFile = -1;
+ static XLogSegNo sendSegNo = 0;
+ static uint32 sendOff = 0;
+
+ p = buf;
+ recptr = startptr;
+ nbytes = count;
+
+ while (nbytes > 0)
+ {
+ uint32 startoff;
+ int segbytes;
+ int writebytes;
+
+ startoff = recptr % XLogSegSize;
+
+ if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+ {
+ char path[MAXPGPATH];
+
+ /* Switch to another logfile segment */
+ if (sendFile >= 0)
+ close(sendFile);
+
+ XLByteToSeg(recptr, sendSegNo);
+ XLogFilePathWrite(path, directory, timeline_id, sendSegNo);
+
+ sendFile = open(path, O_WRONLY|O_CREAT, S_IRUSR | S_IWUSR);
+ if (sendFile < 0)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+ }
+ sendOff = 0;
+ }
+
+ /* Need to seek in the file? */
+ if (sendOff != startoff)
+ {
+ if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+ char fname[MAXPGPATH];
+ XLogFileName(fname, timeline_id, sendSegNo);
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not seek in log segment %s to offset %u: %m",
+ fname,
+ startoff)));
+ }
+ sendOff = startoff;
+ }
+
+ /* How many bytes are within this segment? */
+ if (nbytes > (XLogSegSize - startoff))
+ segbytes = XLogSegSize - startoff;
+ else
+ segbytes = nbytes;
+
+ writebytes = write(sendFile, p, segbytes);
+ if (writebytes <= 0)
+ {
+ char fname[MAXPGPATH];
+ XLogFileName(fname, timeline_id, sendSegNo);
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not write to log segment %s, offset %u, length %lu: %m",
+ fname,
+ sendOff, (unsigned long) segbytes)));
+ }
+
+ /* Update state for read */
+ XLByteAdvance(recptr, writebytes);
+
+ sendOff += writebytes;
+ nbytes -= writebytes;
+ p += writebytes;
+ }
+}
+
+/* this should probably be put in a general implementation */
+static void
+XLogDumpXLogRead(char *buf, TimeLineID timeline_id, XLogRecPtr startptr, Size count)
+{
+ char *p;
+ XLogRecPtr recptr;
+ Size nbytes;
+
+ static int sendFile = -1;
+ static XLogSegNo sendSegNo = 0;
+ static uint32 sendOff = 0;
+
+ p = buf;
+ recptr = startptr;
+ nbytes = count;
+
+ while (nbytes > 0)
+ {
+ uint32 startoff;
+ int segbytes;
+ int readbytes;
+
+ startoff = recptr % XLogSegSize;
+
+ if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
+ {
+ char path[MAXPGPATH];
+
+ /* Switch to another logfile segment */
+ if (sendFile >= 0)
+ close(sendFile);
+
+ XLByteToSeg(recptr, sendSegNo);
+ XLogFilePath(path, timeline_id, sendSegNo);
+
+ sendFile = open(path, O_RDONLY, 0);
+ if (sendFile < 0)
+ {
+ char fname[MAXPGPATH];
+ XLogFileName(fname, timeline_id, sendSegNo);
+ /*
+ * If the file is not found, assume it's because the standby
+ * asked for a too old WAL segment that has already been
+ * removed or recycled.
+ */
+ if (errno == ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("requested WAL segment %s has already been removed",
+ fname)));
+ else
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m",
+ path)));
+ }
+ sendOff = 0;
+ }
+
+ /* Need to seek in the file? */
+ if (sendOff != startoff)
+ {
+ if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0){
+ char fname[MAXPGPATH];
+ XLogFileName(fname, timeline_id, sendSegNo);
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not seek in log segment %s to offset %u: %m",
+ fname,
+ startoff)));
+ }
+ sendOff = startoff;
+ }
+
+ /* How many bytes are within this segment? */
+ if (nbytes > (XLogSegSize - startoff))
+ segbytes = XLogSegSize - startoff;
+ else
+ segbytes = nbytes;
+
+ readbytes = read(sendFile, p, segbytes);
+ if (readbytes <= 0)
+ {
+ char fname[MAXPGPATH];
+ XLogFileName(fname, timeline_id, sendSegNo);
+
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from log segment %s, offset %u, length %lu: %m",
+ fname,
+ sendOff, (unsigned long) segbytes)));
+ }
+
+ /* Update state for read */
+ XLByteAdvance(recptr, readbytes);
+
+ sendOff += readbytes;
+ nbytes -= readbytes;
+ p += readbytes;
+ }
+}
+
+static void
+XLogDumpReadPage(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+ XLogPageHeader page_header;
+ Assert((startptr % XLOG_BLCKSZ) == 0);
+
+ /* FIXME: more sensible/efficient implementation */
+ XLogDumpXLogRead(cur_page, 1, startptr, XLOG_BLCKSZ);
+
+ page_header = (XLogPageHeader)cur_page;
+
+ if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+ {
+ elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+ XLOG_PAGE_MAGIC, (uint32)(startptr << 32), (uint32)startptr);
+ }
+}
+
+static void
+XLogDumpWrite(XLogReaderState* state, char* data, Size len)
+{
+ static char zero[XLOG_BLCKSZ];
+ if(data == NULL)
+ data = zero;
+
+ XLogDumpXLogWrite("/tmp/xlog", 1 /* FIXME */, state->curptr,
+ data, len);
+}
+
+static void
+XLogDumpFinishedRecord(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+ XLogRecord *record = &buf->record;
+ const RmgrData *rmgr = &RmgrTable[record->xl_rmid];
+
+ StringInfo str = makeStringInfo();
+ initStringInfo(str);
+
+ rmgr->rm_desc(str, state->buf.record.xl_info, buf->record_data);
+
+ fprintf(stderr, "xlog record: rmgr: %-11s, record_len: %6u, tot_len: %6u, tx: %10u, lsn: %X/%-8X, prev %X/%-8X, bkp: %u%u%u%u, desc: %s\n",
+ rmgr->rm_name,
+ record->xl_len, record->xl_tot_len,
+ record->xl_xid,
+ (uint32)(buf->origptr >> 32), (uint32)buf->origptr,
+ (uint32)(record->xl_prev >> 32), (uint32)record->xl_prev,
+ !!(XLR_BKP_BLOCK_1 & buf->record.xl_info),
+ !!(XLR_BKP_BLOCK_2 & buf->record.xl_info),
+ !!(XLR_BKP_BLOCK_3 & buf->record.xl_info),
+ !!(XLR_BKP_BLOCK_4 & buf->record.xl_info),
+ str->data);
+
+}
+
+
+static void init()
+{
+ MemoryContextInit();
+ IsPostmasterEnvironment = false;
+ log_min_messages = DEBUG1;
+ Log_error_verbosity = PGERROR_TERSE;
+ pg_timezone_initialize();
+}
+
+int main(int argc, char **argv)
+{
+ uint32 xlogid;
+ uint32 xrecoff;
+ XLogReaderState *xlogreader_state;
+ XLogRecPtr from, to;
+
+ init();
+
+ /* FIXME: should use getopt */
+ if (argc < 4)
+ elog(ERROR, "xlogdump timeline_id start finish");
+
+ if (sscanf(argv[2], "%X/%X", &xlogid, &xrecoff) != 2)
+ elog(ERROR, "couldn't parse argv[2]");
+
+ from = (((uint64)xlogid) << 32) | xrecoff;
+
+ if (sscanf(argv[3], "%X/%X", &xlogid, &xrecoff) != 2)
+ elog(ERROR, "couldn't parse argv[2]");
+
+ to = (uint64)xlogid << 32 | xrecoff;
+
+ xlogreader_state = XLogReaderAllocate();
+
+ /*
+ * not set because we want all records, perhaps we want filtering later?
+ * xlogreader_state->is_record_interesting =
+ */
+ xlogreader_state->finished_record = XLogDumpFinishedRecord;
+
+ /*
+ * not set because we do not want to copy data to somewhere yet
+ * xlogreader_state->writeout_data = ;
+ */
+ xlogreader_state->writeout_data = XLogDumpWrite;
+
+ xlogreader_state->read_page = XLogDumpReadPage;
+
+ xlogreader_state->private_data = NULL;
+
+ xlogreader_state->startptr = from;
+ xlogreader_state->endptr = to;
+
+ XLogReaderRead(xlogreader_state);
+ XLogReaderFree(xlogreader_state);
+ return 0;
+}
This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids relfilenode entries
for nailed/shared catalog entries. Those values cannot be sensibly queries from
the catalog anyway though (the relmapper infrastructure needs to be used).
It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---
src/backend/utils/cache/syscache.c | 11 +++++++++++
src/include/catalog/indexing.h | 2 ++
src/include/catalog/pg_proc.h | 1 +
src/include/utils/syscache.h | 1 +
4 files changed, 15 insertions(+)
Attachments:
0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchtext/x-patch; name=0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchDownload
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ca22efd..9d2f6b7 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -613,6 +613,17 @@ static const struct cachedesc cacheinfo[] = {
},
1024
},
+ {RelationRelationId, /* RELFILENODE */
+ ClassTblspcRelfilenodeIndexId,
+ 2,
+ {
+ Anum_pg_class_reltablespace,
+ Anum_pg_class_relfilenode,
+ 0,
+ 0
+ },
+ 1024
+ },
{RewriteRelationId, /* RULERELNAME */
RewriteRelRulenameIndexId,
2,
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 238fe58..c0a9339 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -106,6 +106,8 @@ DECLARE_UNIQUE_INDEX(pg_class_oid_index, 2662, on pg_class using btree(oid oid_o
#define ClassOidIndexId 2662
DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, on pg_class using btree(relname name_ops, relnamespace oid_ops));
#define ClassNameNspIndexId 2663
+DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 2844, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops));
+#define ClassTblspcRelfilenodeIndexId 2844
DECLARE_UNIQUE_INDEX(pg_collation_name_enc_nsp_index, 3164, on pg_collation using btree(collname name_ops, collencoding int4_ops, collnamespace oid_ops));
#define CollationNameEncNspIndexId 3164
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 77a3b41..d88248a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4667,6 +4667,7 @@ DATA(insert OID = 3473 ( spg_range_quad_leaf_consistent PGNSP PGUID 12 1 0 0 0
DESCR("SP-GiST support for quad tree over range");
+
/*
* Symbolic values for provolatile column: these indicate whether the result
* of a function is dependent *only* on the values of its explicit arguments,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d1a9855..9a39077 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -77,6 +77,7 @@ enum SysCacheIdentifier
RANGETYPE,
RELNAMENSP,
RELOID,
+ RELFILENODE,
RULERELNAME,
STATRELATTINH,
TABLESPACEOID,
This adds a new wal_level value 'logical'
Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---
src/backend/access/heap/heapam.c | 135 +++++++++++++++++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 +++++++++++++++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
6 files changed, 207 insertions(+), 12 deletions(-)
Attachments:
0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchtext/x-patch; name=0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f56b577..190ae03 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
#include "access/xact.h"
#include "access/xlogutils.h"
#include "catalog/catalog.h"
+#include "catalog/index.h"
#include "catalog/namespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -1938,10 +1939,19 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
xl_heap_insert xlrec;
xl_heap_header xlhdr;
XLogRecPtr recptr;
- XLogRecData rdata[3];
+ XLogRecData rdata[4];
Page page = BufferGetPage(buffer);
uint8 info = XLOG_HEAP_INSERT;
+ /*
+ * For the logical replication case we need the tuple even if were
+ * doing a full page write. We could alternatively store a pointer into
+ * the fpw though.
+ * For that to work we add another rdata entry for the buffer in that
+ * case.
+ */
+ bool need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+
xlrec.all_visible_cleared = all_visible_cleared;
xlrec.target.node = relation->rd_node;
xlrec.target.tid = heaptup->t_self;
@@ -1961,18 +1971,32 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
*/
rdata[1].data = (char *) &xlhdr;
rdata[1].len = SizeOfHeapHeader;
- rdata[1].buffer = buffer;
+ rdata[1].buffer = need_tuple ? InvalidBuffer : buffer;
rdata[1].buffer_std = true;
rdata[1].next = &(rdata[2]);
/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
rdata[2].data = (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits);
rdata[2].len = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
- rdata[2].buffer = buffer;
+ rdata[2].buffer = need_tuple ? InvalidBuffer : buffer;
rdata[2].buffer_std = true;
rdata[2].next = NULL;
/*
+ * add record for the buffer without actual content thats removed if
+ * fpw is done for that buffer
+ */
+ if(need_tuple){
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = buffer;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+ }
+
+ /*
* If this is the single and first tuple on page, we can reinit the
* page instead of restoring the whole thing. Set flag, and hide
* buffer references from XLogInsert.
@@ -1981,7 +2005,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
{
info |= XLOG_HEAP_INIT_PAGE;
- rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
+ rdata[1].buffer = rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
}
recptr = XLogInsert(RM_HEAP_ID, info, rdata);
@@ -2569,7 +2593,9 @@ l1:
{
xl_heap_delete xlrec;
XLogRecPtr recptr;
- XLogRecData rdata[2];
+ XLogRecData rdata[4];
+
+ bool need_tuple = wal_level == WAL_LEVEL_LOGICAL && relation->rd_id >= FirstNormalObjectId;
xlrec.all_visible_cleared = all_visible_cleared;
xlrec.target.node = relation->rd_node;
@@ -2585,6 +2611,73 @@ l1:
rdata[1].buffer_std = true;
rdata[1].next = NULL;
+ /*
+ * XXX: We could decide not to log changes when the origin is not the
+ * local node, that should reduce redundant logging.
+ */
+ if(need_tuple){
+ xl_heap_header xlhdr;
+
+ Oid indexoid = InvalidOid;
+ int16 pknratts;
+ int16 pkattnum[INDEX_MAX_KEYS];
+ Oid pktypoid[INDEX_MAX_KEYS];
+ Oid pkopclass[INDEX_MAX_KEYS];
+ TupleDesc desc = RelationGetDescr(relation);
+ Relation index_rel;
+ TupleDesc indexdesc;
+ int natt;
+
+ Datum idxvals[INDEX_MAX_KEYS];
+ bool idxisnull[INDEX_MAX_KEYS];
+ HeapTuple idxtuple;
+
+ MemSet(pkattnum, 0, sizeof(pkattnum));
+ MemSet(pktypoid, 0, sizeof(pktypoid));
+ MemSet(pkopclass, 0, sizeof(pkopclass));
+ MemSet(idxvals, 0, sizeof(idxvals));
+ MemSet(idxisnull, 0, sizeof(idxisnull));
+ relationFindPrimaryKey(relation, &indexoid, &pknratts, pkattnum, pktypoid, pkopclass);
+
+ if(!indexoid){
+ elog(WARNING, "Could not find primary key for table with oid %u",
+ relation->rd_id);
+ goto no_index_found;
+ }
+
+ index_rel = index_open(indexoid, AccessShareLock);
+
+ indexdesc = RelationGetDescr(index_rel);
+
+ for(natt = 0; natt < indexdesc->natts; natt++){
+ idxvals[natt] =
+ fastgetattr(&tp, pkattnum[natt], desc, &idxisnull[natt]);
+ Assert(!idxisnull[natt]);
+ }
+
+ idxtuple = heap_form_tuple(indexdesc, idxvals, idxisnull);
+
+ xlhdr.t_infomask2 = idxtuple->t_data->t_infomask2;
+ xlhdr.t_infomask = idxtuple->t_data->t_infomask;
+ xlhdr.t_hoff = idxtuple->t_data->t_hoff;
+
+ rdata[1].next = &(rdata[2]);
+ rdata[2].data = (char*)&xlhdr;
+ rdata[2].len = SizeOfHeapHeader;
+ rdata[2].buffer = InvalidBuffer;
+ rdata[2].next = NULL;
+
+ rdata[2].next = &(rdata[3]);
+ rdata[3].data = (char *) idxtuple->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ rdata[3].len = idxtuple->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ rdata[3].buffer = InvalidBuffer;
+ rdata[3].next = NULL;
+
+ heap_close(index_rel, NoLock);
+ no_index_found:
+ ;
+ }
+
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE, rdata);
PageSetLSN(page, recptr);
@@ -4414,9 +4507,14 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
xl_heap_header xlhdr;
uint8 info;
XLogRecPtr recptr;
- XLogRecData rdata[4];
+ XLogRecData rdata[5];
Page page = BufferGetPage(newbuf);
+ /*
+ * Just as for XLOG_HEAP_INSERT we need to make sure the tuple
+ */
+ bool need_tuple = wal_level == WAL_LEVEL_LOGICAL;
+
/* Caller should not call me on a non-WAL-logged relation */
Assert(RelationNeedsWAL(reln));
@@ -4447,28 +4545,43 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
xlhdr.t_hoff = newtup->t_data->t_hoff;
/*
- * As with insert records, we need not store the rdata[2] segment if we
- * decide to store the whole buffer instead.
+ * As with insert's logging , we need not store the the Datum containing
+ * tuples separately from the buffer if we do logical replication that
+ * is...
*/
rdata[2].data = (char *) &xlhdr;
rdata[2].len = SizeOfHeapHeader;
- rdata[2].buffer = newbuf;
+ rdata[2].buffer = need_tuple ? InvalidBuffer : newbuf;
rdata[2].buffer_std = true;
rdata[2].next = &(rdata[3]);
/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
- rdata[3].buffer = newbuf;
+ rdata[3].buffer = need_tuple ? InvalidBuffer : newbuf;
rdata[3].buffer_std = true;
rdata[3].next = NULL;
+ /*
+ * separate storage for the buffer reference of the new page in the
+ * wal_level=logical case
+ */
+ if(need_tuple){
+ rdata[3].next = &(rdata[4]);
+
+ rdata[4].data = NULL,
+ rdata[4].len = 0;
+ rdata[4].buffer = newbuf;
+ rdata[4].buffer_std = true;
+ rdata[4].next = NULL;
+ }
+
/* If new tuple is the single and first tuple on page... */
if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
{
info |= XLOG_HEAP_INIT_PAGE;
- rdata[2].buffer = rdata[3].buffer = InvalidBuffer;
+ rdata[2].buffer = rdata[3].buffer = rdata[4].buffer = InvalidBuffer;
}
recptr = XLogInsert(RM_HEAP_ID, info, rdata);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..53a0bc8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -107,6 +107,7 @@ const struct config_enum_entry wal_level_options[] = {
{"minimal", WAL_LEVEL_MINIMAL, false},
{"archive", WAL_LEVEL_ARCHIVE, false},
{"hot_standby", WAL_LEVEL_HOT_STANDBY, false},
+ {"logical", WAL_LEVEL_LOGICAL, false},
{NULL, 0, false}
};
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 464950b..8145997 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -49,6 +49,7 @@
#include "nodes/nodeFuncs.h"
#include "optimizer/clauses.h"
#include "parser/parser.h"
+#include "parser/parse_relation.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -3322,3 +3323,76 @@ ResetReindexPending(void)
{
pendingReindexedIndexes = NIL;
}
+
+/*
+ * relationFindPrimaryKey
+ * Find primary key for a relation if it exists.
+ *
+ * If no primary key is found *indexOid is set to InvalidOid
+ *
+ * This is quite similar to tablecmd.c's transformFkeyGetPrimaryKey.
+ *
+ * XXX: It might be a good idea to change pg_class.relhaspkey into an bool to
+ * make this more efficient.
+ */
+void
+relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+ int16 *nratts, int16 *attnums, Oid *atttypids,
+ Oid *opclasses){
+ List *indexoidlist;
+ ListCell *indexoidscan;
+ HeapTuple indexTuple = NULL;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ int i;
+ Form_pg_index indexStruct = NULL;
+
+ *indexOid = InvalidOid;
+
+ indexoidlist = RelationGetIndexList(pkrel);
+
+ foreach(indexoidscan, indexoidlist)
+ {
+ Oid indexoid = lfirst_oid(indexoidscan);
+
+ indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indexoid));
+ if(!HeapTupleIsValid(indexTuple))
+ elog(ERROR, "cache lookup failed for index %u", indexoid);
+
+ indexStruct = (Form_pg_index) GETSTRUCT(indexTuple);
+ if(indexStruct->indisprimary && indexStruct->indimmediate)
+ {
+ *indexOid = indexoid;
+ break;
+ }
+ ReleaseSysCache(indexTuple);
+
+ }
+ list_free(indexoidlist);
+
+ if (!OidIsValid(*indexOid))
+ return;
+
+ /* Must get indclass the hard way */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, indexTuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+ *nratts = indexStruct->indnatts;
+ /*
+ * Now build the list of PK attributes from the indkey definition (we
+ * assume a primary key cannot have expressional elements)
+ */
+ for (i = 0; i < indexStruct->indnatts; i++)
+ {
+ int pkattno = indexStruct->indkey.values[i];
+
+ attnums[i] = pkattno;
+ atttypids[i] = attnumTypeId(pkrel, pkattno);
+ opclasses[i] = indclass->values[i];
+ }
+
+ ReleaseSysCache(indexTuple);
+}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 129c4d0..10080d0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -77,6 +77,8 @@ wal_level_str(WalLevel wal_level)
return "archive";
case WAL_LEVEL_HOT_STANDBY:
return "hot_standby";
+ case WAL_LEVEL_LOGICAL:
+ return "logical";
}
return _("unrecognized wal_level");
}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2893f3b..7d90416 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -200,7 +200,8 @@ typedef enum WalLevel
{
WAL_LEVEL_MINIMAL = 0,
WAL_LEVEL_ARCHIVE,
- WAL_LEVEL_HOT_STANDBY
+ WAL_LEVEL_HOT_STANDBY,
+ WAL_LEVEL_LOGICAL
} WalLevel;
extern int wal_level;
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index eb417ce..3de0a29 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -102,4 +102,8 @@ extern bool ReindexIsProcessingHeap(Oid heapOid);
extern bool ReindexIsProcessingIndex(Oid indexOid);
extern Oid IndexGetRelation(Oid indexId, bool missing_ok);
+extern void relationFindPrimaryKey(Relation pkrel, Oid *indexOid,
+ int16 *nratts, int16 *attnums, Oid *atttypids,
+ Oid *opclasses);
+
#endif /* INDEX_H */
Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode
---
src/backend/utils/cache/inval.c | 2 +-
src/include/utils/inval.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
Attachments:
0007-Make-InvalidateSystemCaches-public.patchtext/x-patch; name=0007-Make-InvalidateSystemCaches-public.patchDownload
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index e26bf0b..c75c032 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -547,7 +547,7 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
* since that tells us we've lost some shared-inval messages and hence
* don't know what needs to be invalidated.
*/
-static void
+void
InvalidateSystemCaches(void)
{
int i;
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index c5549a6..648bfdc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -67,4 +67,6 @@ extern void CallSyscacheCallbacks(int cacheid, uint32 hashvalue);
extern void inval_twophase_postcommit(TransactionId xid, uint16 info,
void *recdata, uint32 len);
+extern void InvalidateSystemCaches(void);
+
#endif /* INVAL_H */
This introduces several things:
* applycache module which reassembles transactions from a stream of interspersed changes
* snapbuilder which builds catalog snapshots so that tuples from wal can be understood
* wal decoding into an applycache
* decode_xlog(lsn, lsn) debugging function
The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit
It is missing several parts:
- spill-to-disk
- resource usage controls
- command id handling
- passing of the correct mvcc snapshot (already has it, just doesn't pass)
The snapshot building has the most critical infrastructure but misses several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling
The xlog decoding also misses several parts:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- handling of table rewrites
---
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/applycache.c | 574 +++++++++++++
src/backend/replication/logical/decode.c | 366 +++++++++
src/backend/replication/logical/logicalfuncs.c | 237 ++++++
src/backend/replication/logical/snapbuild.c | 1045 ++++++++++++++++++++++++
src/backend/utils/time/tqual.c | 161 ++++
src/include/access/transam.h | 5 +
src/include/catalog/pg_proc.h | 3 +
src/include/replication/applycache.h | 239 ++++++
src/include/replication/decode.h | 26 +
src/include/replication/snapbuild.h | 119 +++
src/include/utils/tqual.h | 21 +-
13 files changed, 2816 insertions(+), 1 deletion(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/applycache.c
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/applycache.h
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/snapbuild.h
Attachments:
0008-Introduce-wal-decoding-via-catalog-timetravel.patchtext/x-patch; name=0008-Introduce-wal-decoding-via-catalog-timetravel.patchDownload
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 9d9ec87..ae7f6b1 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -17,6 +17,8 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
repl_gram.o syncrep.o
+SUBDIRS = logical
+
include $(top_srcdir)/src/backend/common.mk
# repl_scanner is compiled as part of repl_gram
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
new file mode 100644
index 0000000..4e56769
--- /dev/null
+++ b/src/backend/replication/logical/Makefile
@@ -0,0 +1,19 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for src/backend/replication/logical
+#
+# IDENTIFICATION
+# src/backend/replication/logical/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/logical
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
+
+OBJS = applycache.o decode.o snapbuild.o logicalfuncs.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/logical/applycache.c b/src/backend/replication/logical/applycache.c
new file mode 100644
index 0000000..1e08371
--- /dev/null
+++ b/src/backend/replication/logical/applycache.c
@@ -0,0 +1,574 @@
+/*-------------------------------------------------------------------------
+ *
+ * applycache.c
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/replication/applycache.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/xact.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_control.h"
+#include "replication/applycache.h"
+
+#include "lib/simpleheap.h"
+
+#include "utils/ilist.h"
+#include "utils/memutils.h"
+#include "utils/relcache.h"
+#include "utils/tqual.h"
+#include "utils/syscache.h"
+
+
+const Size max_memtries = 1<<16;
+
+const size_t max_cached_changes = 1024;
+const size_t max_cached_tuplebufs = 1024; /* ~8MB */
+const size_t max_cached_transactions = 512;
+
+typedef struct ApplyCacheTXNByIdEnt
+{
+ TransactionId xid;
+ ApplyCacheTXN* txn;
+} ApplyCacheTXNByIdEnt;
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache);
+static void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn);
+
+static ApplyCacheTXN* ApplyCacheTXNByXid(ApplyCache*, TransactionId xid,
+ bool create, bool* is_new);
+
+
+ApplyCache*
+ApplyCacheAllocate(void)
+{
+ ApplyCache* cache = (ApplyCache*)malloc(sizeof(ApplyCache));
+ HASHCTL hash_ctl;
+
+ if (!cache)
+ elog(ERROR, "Could not allocate the ApplyCache");
+
+ cache->build_snapshots = true;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+
+ cache->context = AllocSetContextCreate(TopMemoryContext,
+ "ApplyCache",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+
+ hash_ctl.keysize = sizeof(TransactionId);
+ hash_ctl.entrysize = sizeof(ApplyCacheTXNByIdEnt);
+ hash_ctl.hash = tag_hash;
+ hash_ctl.hcxt = cache->context;
+
+ cache->by_txn = hash_create("ApplyCacheByXid", 1000, &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
+
+ cache->nr_cached_transactions = 0;
+ cache->nr_cached_changes = 0;
+ cache->nr_cached_tuplebufs = 0;
+
+ ilist_d_init(&cache->cached_transactions);
+ ilist_d_init(&cache->cached_changes);
+ ilist_s_init(&cache->cached_tuplebufs);
+
+ return cache;
+}
+
+void ApplyCacheFree(ApplyCache* cache)
+{
+ /* FIXME: check for in-progress transactions */
+ /* FIXME: clean up cached transaction */
+ /* FIXME: clean up cached changes */
+ /* FIXME: clean up cached tuplebufs */
+ hash_destroy(cache->by_txn);
+ free(cache);
+}
+
+static ApplyCacheTXN* ApplyCacheGetTXN(ApplyCache *cache)
+{
+ ApplyCacheTXN* txn;
+
+ if (cache->nr_cached_transactions)
+ {
+ cache->nr_cached_transactions--;
+ txn = ilist_container(ApplyCacheTXN, node,
+ ilist_d_pop_front(&cache->cached_transactions));
+ }
+ else
+ {
+ txn = (ApplyCacheTXN*)
+ malloc(sizeof(ApplyCacheTXN));
+
+ if (!txn)
+ elog(ERROR, "Could not allocate a ApplyCacheTXN struct");
+ }
+
+ memset(txn, 0, sizeof(ApplyCacheTXN));
+ ilist_d_init(&txn->changes);
+ ilist_d_init(&txn->subtxns);
+ ilist_d_init(&txn->snapshots);
+ ilist_d_init(&txn->commandids);
+
+ return txn;
+}
+
+void ApplyCacheReturnTXN(ApplyCache *cache, ApplyCacheTXN* txn)
+{
+ if(cache->nr_cached_transactions < max_cached_transactions){
+ cache->nr_cached_transactions++;
+ ilist_d_push_front(&cache->cached_transactions, &txn->node);
+ }
+ else{
+ free(txn);
+ }
+}
+
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache* cache)
+{
+ ApplyCacheChange* change;
+
+ if (cache->nr_cached_changes)
+ {
+ cache->nr_cached_changes--;
+ change = ilist_container(ApplyCacheChange, node,
+ ilist_d_pop_front(&cache->cached_changes));
+ }
+ else
+ {
+ change = (ApplyCacheChange*)malloc(sizeof(ApplyCacheChange));
+
+ if (!change)
+ elog(ERROR, "Could not allocate a ApplyCacheChange struct");
+ }
+
+
+ memset(change, 0, sizeof(ApplyCacheChange));
+ return change;
+}
+
+void
+ApplyCacheReturnChange(ApplyCache* cache, ApplyCacheChange* change)
+{
+ switch(change->action){
+ case APPLY_CACHE_CHANGE_INSERT:
+ case APPLY_CACHE_CHANGE_UPDATE:
+ case APPLY_CACHE_CHANGE_DELETE:
+ if (change->newtuple)
+ {
+ ApplyCacheReturnTupleBuf(cache, change->newtuple);
+ change->newtuple = NULL;
+ }
+
+ if (change->oldtuple)
+ {
+ ApplyCacheReturnTupleBuf(cache, change->oldtuple);
+ change->oldtuple = NULL;
+ }
+
+ if (change->table)
+ {
+ heap_freetuple(change->table);
+ change->table = NULL;
+ }
+ break;
+ case APPLY_CACHE_CHANGE_SNAPSHOT:
+ if (change->snapshot)
+ {
+ /* FIXME: free snapshot */
+ change->snapshot = NULL;
+ }
+ case APPLY_CACHE_CHANGE_COMMAND_ID:
+ break;
+ }
+
+ if(cache->nr_cached_changes < max_cached_changes){
+ cache->nr_cached_changes++;
+ ilist_d_push_front(&cache->cached_changes, &change->node);
+ }
+ else{
+ free(change);
+ }
+}
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache* cache)
+{
+ ApplyCacheTupleBuf* tuple;
+
+ if (cache->nr_cached_tuplebufs)
+ {
+ cache->nr_cached_tuplebufs--;
+ tuple = ilist_container(ApplyCacheTupleBuf, node,
+ ilist_s_pop_front(&cache->cached_tuplebufs));
+ }
+ else
+ {
+ tuple =
+ (ApplyCacheTupleBuf*)malloc(sizeof(ApplyCacheTupleBuf));
+
+ if (!tuple)
+ elog(ERROR, "Could not allocate a ApplyCacheTupleBuf struct");
+ }
+
+ return tuple;
+}
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple)
+{
+ if(cache->nr_cached_tuplebufs < max_cached_tuplebufs){
+ cache->nr_cached_tuplebufs++;
+ ilist_s_push_front(&cache->cached_tuplebufs, &tuple->node);
+ }
+ else{
+ free(tuple);
+ }
+}
+
+
+static
+ApplyCacheTXN*
+ApplyCacheTXNByXid(ApplyCache* cache, TransactionId xid, bool create, bool* is_new)
+{
+ ApplyCacheTXNByIdEnt* ent;
+ bool found;
+
+ /* FIXME: add one entry fast-path cache */
+
+ ent = (ApplyCacheTXNByIdEnt*)
+ hash_search(cache->by_txn,
+ (void *)&xid,
+ (create ? HASH_ENTER : HASH_FIND),
+ &found);
+
+ if (found)
+ {
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "found cache entry for %u at %p", xid, ent);
+#endif
+ }
+ else
+ {
+#ifdef VERBOSE_DEBUG
+ elog(LOG, "didn't find cache entry for %u in %p at %p, creating %u",
+ xid, cache, ent, create);
+#endif
+ }
+
+ if (!found && !create)
+ return NULL;
+
+ if (!found)
+ {
+ ent->txn = ApplyCacheGetTXN(cache);
+ ent->txn->xid = xid;
+ }
+
+ if (is_new)
+ *is_new = !found;
+
+ return ent->txn;
+}
+
+void
+ApplyCacheAddChange(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn,
+ ApplyCacheChange* change)
+{
+ ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, true, NULL);
+ txn->lsn = lsn;
+ ilist_d_push_back(&txn->changes, &change->node);
+}
+
+
+void
+ApplyCacheCommitChild(ApplyCache* cache, TransactionId xid,
+ TransactionId subxid, XLogRecPtr lsn)
+{
+ ApplyCacheTXN* txn;
+ ApplyCacheTXN* subtxn;
+
+ subtxn = ApplyCacheTXNByXid(cache, subxid, false, NULL);
+
+ /*
+ * No need to do anything if that subtxn didn't contain any changes
+ */
+ if (!subtxn)
+ return;
+
+ subtxn->lsn = lsn;
+
+ txn = ApplyCacheTXNByXid(cache, xid, true, NULL);
+
+ ilist_d_push_back(&txn->subtxns, &subtxn->node);
+}
+
+typedef struct ApplyCacheIterTXNState
+{
+ simpleheap *heap;
+} ApplyCacheIterTXNState;
+
+static int
+ApplyCacheIterCompare(simpleheap_kv* a, simpleheap_kv* b)
+{
+ ApplyCacheChange *change_a = ilist_container(ApplyCacheChange, node, a->key);
+ ApplyCacheChange *change_b = ilist_container(ApplyCacheChange, node, b->key);
+
+ if (change_a->lsn < change_b->lsn)
+ return -1;
+
+ else if (change_a->lsn == change_b->lsn)
+ return 0;
+
+ return 1;
+}
+
+static ApplyCacheIterTXNState*
+ApplyCacheIterTXNInit(ApplyCache* cache, ApplyCacheTXN* txn);
+
+static ApplyCacheChange*
+ApplyCacheIterTXNNext(ApplyCache* cache, ApplyCacheIterTXNState* state);
+
+static void
+ApplyCacheIterTXNFinish(ApplyCache* cache, ApplyCacheIterTXNState* state);
+
+
+
+static ApplyCacheIterTXNState*
+ApplyCacheIterTXNInit(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+ size_t nr_txns = 0; /* main txn */
+ ApplyCacheIterTXNState *state;
+ ilist_d_node* cur_txn_i;
+ ApplyCacheTXN *cur_txn;
+ ApplyCacheChange *cur_change;
+
+ if (!ilist_d_is_empty(&txn->changes))
+ nr_txns++;
+
+ /* count how large our heap must be */
+ ilist_d_foreach(cur_txn_i, &txn->subtxns)
+ {
+ cur_txn = ilist_container(ApplyCacheTXN, node, cur_txn_i);
+
+ if (!ilist_d_is_empty(&cur_txn->changes))
+ nr_txns++;
+ }
+
+ /* allocate array for our heap */
+ state = palloc0(sizeof(ApplyCacheIterTXNState));
+
+ state->heap = simpleheap_allocate(nr_txns);
+ state->heap->compare = ApplyCacheIterCompare;
+
+ /* fill array with elements, heap condition not yet fullfilled */
+ if (!ilist_d_is_empty(&txn->changes))
+ {
+ cur_change = ilist_d_front_unchecked(ApplyCacheChange, node, &txn->changes);
+
+ simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+ }
+
+ ilist_d_foreach(cur_txn_i, &txn->subtxns)
+ {
+ cur_txn = ilist_container(ApplyCacheTXN, node, cur_txn_i);
+
+ if (!ilist_d_is_empty(&cur_txn->changes))
+ {
+ cur_change = ilist_d_front_unchecked(ApplyCacheChange, node, &cur_txn->changes);
+
+ simpleheap_add_unordered(state->heap, &cur_change->node, txn);
+ }
+ }
+
+ /* make the array fullfill the heap property */
+ simpleheap_build(state->heap);
+ return state;
+}
+
+static ApplyCacheChange*
+ApplyCacheIterTXNNext(ApplyCache* cache, ApplyCacheIterTXNState* state)
+{
+ ApplyCacheTXN *txn = NULL;
+ ApplyCacheChange *change;
+ simpleheap_kv *kv;
+
+ /*
+ * Do a k-way merge between transactions/subtransactions to extract changes
+ * merged by the lsn of their change. For that we model the current heads
+ * of the different transactions as a binary heap so we easily know which
+ * (sub-)transaction has the change with the smalles lsn next.
+ */
+
+ /* nothing there anymore */
+ if (state->heap->size == 0)
+ return NULL;
+
+ kv = simpleheap_first(state->heap);
+
+ change = ilist_container(ApplyCacheChange, node, kv->key);
+
+ txn = (ApplyCacheTXN*)kv->value;
+
+ if (!ilist_d_has_next(&txn->changes, &change->node))
+ {
+ simpleheap_remove_first(state->heap);
+ }
+ else
+ {
+ simpleheap_change_key(state->heap, change->node.next);
+ }
+ return change;
+}
+
+static void
+ApplyCacheIterTXNFinish(ApplyCache* cache, ApplyCacheIterTXNState* state)
+{
+ simpleheap_free(state->heap);
+ pfree(state);
+}
+
+
+static void
+ApplyCacheCleanupTXN(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+ bool found;
+ ilist_d_node* cur_change, *next_change;
+ ilist_d_node* cur_txn, *next_txn;
+
+ /* cleanup transactions & changes */
+ ilist_d_foreach_modify (cur_txn, next_txn, &txn->subtxns)
+ {
+ ApplyCacheTXN* subtxn = ilist_container(ApplyCacheTXN, node, cur_txn);
+
+ ilist_d_foreach_modify (cur_change, next_change, &subtxn->changes)
+ {
+ ApplyCacheChange* change =
+ ilist_container(ApplyCacheChange, node, cur_change);
+
+ ApplyCacheReturnChange(cache, change);
+ }
+ ApplyCacheReturnTXN(cache, subtxn);
+ }
+
+ ilist_d_foreach_modify (cur_change, next_change, &txn->changes)
+ {
+ ApplyCacheChange* change =
+ ilist_container(ApplyCacheChange, node, cur_change);
+
+ ApplyCacheReturnChange(cache, change);
+ }
+
+ /* now remove reference from cache */
+ hash_search(cache->by_txn,
+ (void *)&txn->xid,
+ HASH_REMOVE,
+ &found);
+ Assert(found);
+
+ ApplyCacheReturnTXN(cache, txn);
+}
+void
+ApplyCacheCommit(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+ ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false, NULL);
+ ApplyCacheIterTXNState* iterstate;
+ ApplyCacheChange* change;
+ CommandId command_id;
+ Snapshot snapshot_mvcc = NULL;
+
+ if (!txn)
+ return;
+
+ txn->lsn = lsn;
+
+ cache->begin(cache, txn);
+
+ PG_TRY();
+ {
+ iterstate = ApplyCacheIterTXNInit(cache, txn);
+ while((change = ApplyCacheIterTXNNext(cache, iterstate)))
+ {
+ switch(change->action){
+ case APPLY_CACHE_CHANGE_INSERT:
+ case APPLY_CACHE_CHANGE_UPDATE:
+ case APPLY_CACHE_CHANGE_DELETE:
+ Assert(snapshot_mvcc != NULL);
+ cache->apply_change(cache, txn, txn /*FIXME*/, change);
+ break;
+ case APPLY_CACHE_CHANGE_SNAPSHOT:
+ /*
+ * the first snapshot seen in a transaction is its mvcc
+ * snapshot
+ */
+ if (!snapshot_mvcc)
+ snapshot_mvcc = change->snapshot;
+ SetupDecodingSnapshots(change->snapshot);
+ break;
+ case APPLY_CACHE_CHANGE_COMMAND_ID:
+ /* FIXME */
+ command_id = change->command_id;
+ break;
+ }
+ }
+
+ ApplyCacheIterTXNFinish(cache, iterstate);
+
+ cache->commit(cache, txn);
+
+ ApplyCacheCleanupTXN(cache, txn);
+ RevertFromDecodingSnapshots();
+ }
+ PG_CATCH();
+ {
+ RevertFromDecodingSnapshots();
+ PG_RE_THROW();
+ }
+ PG_END_TRY();
+}
+
+void
+ApplyCacheAbort(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn)
+{
+ ApplyCacheTXN* txn = ApplyCacheTXNByXid(cache, xid, false, NULL);
+
+ /* no changes in this commit */
+ if (!txn)
+ return;
+
+ ApplyCacheCleanupTXN(cache, txn);
+}
+
+bool
+ApplyCacheIsXidKnown(ApplyCache* cache, TransactionId xid)
+{
+ bool is_new;
+ /* FIXME: for efficiency reasons we create the xid here, that doesn't seem
+ * like a good idea though */
+ ApplyCacheTXNByXid(cache, xid, true, &is_new);
+
+ /* no changes in this commit */
+ return !is_new;
+}
+
+void
+ApplyCacheAddBaseSnapshot(ApplyCache* cache, TransactionId xid, XLogRecPtr lsn, Snapshot snap)
+{
+ ApplyCacheChange *change = ApplyCacheGetChange(cache);
+ change->snapshot = snap;
+ change->action = APPLY_CACHE_CHANGE_SNAPSHOT;
+
+ ApplyCacheAddChange(cache, xid, lsn, change);
+}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
new file mode 100644
index 0000000..244dd7b
--- /dev/null
+++ b/src/backend/replication/logical/decode.c
@@ -0,0 +1,366 @@
+/*-------------------------------------------------------------------------
+ *
+ * decode.c
+ *
+ * Decodes wal records from an xlogreader.h callback into an applycache.
+ *
+ * Portions Copyright (c) 2010-2012, PostgreSQL Global Development Group
+ *
+ * NOTE:
+ *
+ * Its possible that the separation between decode.c and snapbuild.c is a
+ * bit too strict, in the end they just about have the same struct.
+ *
+ * IDENTIFICATION
+ * src/backend/replication/logical/decode.c
+ *
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xact.h"
+#include "access/heapam_xlog.h"
+
+#include "catalog/pg_control.h"
+
+#include "replication/applycache.h"
+#include "replication/decode.h"
+#include "replication/snapbuild.h"
+
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+
+static void DecodeXLogTuple(char* data, Size len, ApplyCacheTupleBuf* tuple);
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf);
+static void DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf);
+
+static void DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+ TransactionId *sub_xids, int nsubxacts);
+
+
+void DecodeRecordIntoApplyCache(ReaderApplyState *state, XLogRecordBuffer* buf)
+{
+ XLogRecord* r = &buf->record;
+ uint8 info = r->xl_info & ~XLR_INFO_MASK;
+ ApplyCache *cache = state->apply_cache;
+ SnapBuildAction action;
+
+ /*
+ * FIXME: The existance of the snapshot builder is pretty obvious to the
+ * outside right now, that doesn't seem to be very good...
+ */
+ if(!state->snapstate)
+ {
+ state->snapstate = AllocateSnapshotBuilder(cache);
+ }
+
+ /*
+ * Call the snapshot builder. It needs to be called before we analyze
+ * tuples for two reasons:
+ *
+ * * Only in the snapshot building logic we know whether we have enough
+ * information to decode a particular tuple
+ *
+ * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+ * added to the ApplyCache before we add tuples using them
+ */
+ action = SnapBuildCallback(cache, state->snapstate, buf);
+
+ if (action == SNAPBUILD_SKIP)
+ return;
+
+ switch (r->xl_rmid)
+ {
+ case RM_HEAP_ID:
+ {
+ info &= XLOG_HEAP_OPMASK;
+ switch (info)
+ {
+ case XLOG_HEAP_INSERT:
+ DecodeInsert(cache, buf);
+ break;
+
+ /* no guarantee that we get an HOT update again, so handle it as a normal update*/
+ case XLOG_HEAP_HOT_UPDATE:
+ case XLOG_HEAP_UPDATE:
+ DecodeUpdate(cache, buf);
+ break;
+
+ case XLOG_HEAP_NEWPAGE:
+ DecodeNewpage(cache, buf);
+ break;
+
+ case XLOG_HEAP_DELETE:
+ DecodeDelete(cache, buf);
+ break;
+ default:
+ break;
+ }
+ break;
+ }
+ case RM_HEAP2_ID:
+ {
+ info &= XLOG_HEAP_OPMASK;
+ switch (info)
+ {
+ case XLOG_HEAP2_MULTI_INSERT:
+ DecodeMultiInsert(cache, buf);
+ break;
+ default:
+ /* everything else here is just physical stuff were not interested in */
+ break;
+ }
+ break;
+ }
+
+ case RM_XACT_ID:
+ {
+ switch (info)
+ {
+ case XLOG_XACT_COMMIT:
+ {
+ TransactionId *sub_xids;
+ xl_xact_commit *xlrec = (xl_xact_commit*)buf->record_data;
+
+ /* FIXME: this is not really allowed if there is no subtransactions */
+ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+ DecodeCommit(cache, buf, r->xl_xid, sub_xids, xlrec->nsubxacts);
+
+ break;
+ }
+ case XLOG_XACT_COMMIT_PREPARED:
+ {
+ TransactionId *sub_xids;
+ xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared*)buf->record_data;
+
+ sub_xids = (TransactionId *) &(xlrec->crec.xnodes[xlrec->crec.nrels]);
+
+ DecodeCommit(cache, buf, r->xl_xid, sub_xids,
+ xlrec->crec.nsubxacts);
+
+ break;
+ }
+ case XLOG_XACT_COMMIT_COMPACT:
+ {
+ xl_xact_commit_compact *xlrec = (xl_xact_commit_compact*)buf->record_data;
+ DecodeCommit(cache, buf, r->xl_xid, xlrec->subxacts,
+ xlrec->nsubxacts);
+ break;
+ }
+ case XLOG_XACT_ABORT:
+ case XLOG_XACT_ABORT_PREPARED:
+ {
+ TransactionId *sub_xids;
+ xl_xact_abort *xlrec = (xl_xact_abort*)buf->record_data;
+ int i;
+
+ /* FIXME: this is not really allowed if there is no subtransaction */
+ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
+
+ for(i = 0; i < xlrec->nsubxacts; i++)
+ {
+ ApplyCacheAbort(cache, *sub_xids, buf->origptr);
+ sub_xids += 1;
+ }
+
+ /* TODO: check that this also contains not-yet-aborted subtxns */
+ ApplyCacheAbort(cache, r->xl_xid, buf->origptr);
+
+ elog(WARNING, "ABORT %u", r->xl_xid);
+ break;
+ }
+ case XLOG_XACT_ASSIGNMENT:
+ /*
+ * XXX: We could reassign transactions to the parent here
+ * to save space and effort when merging transactions at
+ * commit.
+ */
+ break;
+ case XLOG_XACT_PREPARE:
+ /*
+ * FXIME: we should replay the transaction and prepare it
+ * as well.
+ */
+ break;
+ default:
+ break;
+ ;
+ }
+ break;
+ }
+ case RM_XLOG_ID:
+ {
+ switch (info)
+ {
+ /* this is also used in END_OF_RECOVERY checkpoints */
+ case XLOG_CHECKPOINT_SHUTDOWN:
+ /*
+ * abort all transactions that still are in progress, they
+ * aren't in progress anymore.
+ * do not abort prepared transactions that have been
+ * prepared for commit.
+ * FIXME: implement.
+ */
+ break;
+ }
+ }
+ default:
+ break;
+ }
+}
+
+static void
+DecodeCommit(ApplyCache* cache, XLogRecordBuffer* buf, TransactionId xid,
+ TransactionId *sub_xids, int nsubxacts)
+{
+ int i;
+
+ for (i = 0; i < nsubxacts; i++)
+ {
+ ApplyCacheCommitChild(cache, xid, *sub_xids, buf->origptr);
+ sub_xids++;
+ }
+
+ /* replay actions of all transaction + subtransactions in order */
+ ApplyCacheCommit(cache, xid, buf->origptr);
+}
+
+static void DecodeInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+ XLogRecord* r = &buf->record;
+ xl_heap_insert *xlrec = (xl_heap_insert *) buf->record_data;
+
+ ApplyCacheChange* change;
+
+ if (r->xl_info & XLR_BKP_BLOCK_1
+ && r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader))
+ {
+ elog(FATAL, "huh, no tuple data on wal_level = logical?");
+ }
+
+ change = ApplyCacheGetChange(cache);
+ change->action = APPLY_CACHE_CHANGE_INSERT;
+
+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+ change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+ DecodeXLogTuple((char*)xlrec + SizeOfHeapInsert,
+ r->xl_len - SizeOfHeapInsert,
+ change->newtuple);
+
+ ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void
+DecodeUpdate(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+ XLogRecord* r = &buf->record;
+ xl_heap_update *xlrec = (xl_heap_update *) buf->record_data;
+
+
+ ApplyCacheChange* change;
+
+ if ((r->xl_info & XLR_BKP_BLOCK_1 || r->xl_info & XLR_BKP_BLOCK_2) &&
+ (r->xl_len < (SizeOfHeapUpdate + SizeOfHeapHeader)))
+ {
+ elog(FATAL, "huh, no tuple data on wal_level = logical?");
+ }
+
+ change = ApplyCacheGetChange(cache);
+ change->action = APPLY_CACHE_CHANGE_UPDATE;
+
+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+ /* FIXME: need to save the old tuple as well if we want primary key changes to work. */
+ change->newtuple = ApplyCacheGetTupleBuf(cache);
+
+ DecodeXLogTuple((char*)xlrec + SizeOfHeapUpdate,
+ r->xl_len - SizeOfHeapUpdate,
+ change->newtuple);
+
+ ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+static void DecodeDelete(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+ XLogRecord* r = &buf->record;
+
+ xl_heap_delete *xlrec = (xl_heap_delete *) buf->record_data;
+
+ ApplyCacheChange* change;
+
+ change = ApplyCacheGetChange(cache);
+ change->action = APPLY_CACHE_CHANGE_DELETE;
+
+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));
+
+ if (r->xl_len <= (SizeOfHeapDelete + SizeOfHeapHeader))
+ {
+ elog(FATAL, "huh, no primary key for a delete on wal_level = logical?");
+ }
+
+ change->oldtuple = ApplyCacheGetTupleBuf(cache);
+
+ DecodeXLogTuple((char*)xlrec + SizeOfHeapDelete,
+ r->xl_len - SizeOfHeapDelete,
+ change->oldtuple);
+
+ ApplyCacheAddChange(cache, r->xl_xid, buf->origptr, change);
+}
+
+
+static void
+DecodeNewpage(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+ elog(WARNING, "skipping XLOG_HEAP_NEWPAGE record because we are too dumb");
+}
+
+static void
+DecodeMultiInsert(ApplyCache *cache, XLogRecordBuffer* buf)
+{
+ elog(WARNING, "skipping XLOG_HEAP2_MULTI_INSERT record because we are too dumb");
+}
+
+
+static void DecodeXLogTuple(char* data, Size len, ApplyCacheTupleBuf* tuple)
+{
+ xl_heap_header xlhdr;
+ int datalen = len - SizeOfHeapHeader;
+
+ Assert(datalen >= 0);
+ Assert(datalen <= MaxHeapTupleSize);
+
+ tuple->tuple.t_len = datalen + offsetof(HeapTupleHeaderData, t_bits);
+
+ /* not a disk based tuple */
+ ItemPointerSetInvalid(&tuple->tuple.t_self);
+
+ tuple->tuple.t_tableOid = InvalidOid;
+ tuple->tuple.t_data = &tuple->header;
+
+ /* data is not stored aligned */
+ memcpy((char *) &xlhdr,
+ data,
+ SizeOfHeapHeader);
+
+ memset(&tuple->header, 0, sizeof(HeapTupleHeaderData));
+
+ memcpy((char *) &tuple->header + offsetof(HeapTupleHeaderData, t_bits),
+ data + SizeOfHeapHeader,
+ datalen);
+
+ tuple->header.t_infomask = xlhdr.t_infomask;
+ tuple->header.t_infomask2 = xlhdr.t_infomask2;
+ tuple->header.t_hoff = xlhdr.t_hoff;
+}
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
new file mode 100644
index 0000000..035c48a
--- /dev/null
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -0,0 +1,237 @@
+/*-------------------------------------------------------------------------
+ *
+ * logicalfuncs.c
+ *
+ * Support functions for using xlog decoding
+ *
+ * NOTE:
+ * Nothing in here should be sued for anythign but debugging!
+ *
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/replication/snapbuild.c
+ *
+ */
+
+#include "postgres.h"
+
+#include "access/xlogreader.h"
+
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+
+#include "replication/applycache.h"
+#include "replication/decode.h"
+#include "replication/walreceiver.h"
+/*FIXME: XLogRead*/
+#include "replication/walsender_private.h"
+
+#include "utils/inval.h"
+#include "utils/lsyscache.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+
+
+/* We don't need no header */
+extern Datum
+decode_xlog(PG_FUNCTION_ARGS);
+
+
+static bool
+replay_record_is_interesting(XLogReaderState* state, XLogRecord* r)
+{
+ return true;
+}
+
+static void
+replay_writeout_data(XLogReaderState* state, char* data, Size len)
+{
+ return;
+}
+
+static void
+replay_finished_record(XLogReaderState* state, XLogRecordBuffer* buf)
+{
+ ReaderApplyState* apply_state = state->private_data;
+ DecodeRecordIntoApplyCache(apply_state, buf);
+}
+
+static void
+replay_read_page(XLogReaderState* state, char* cur_page, XLogRecPtr startptr)
+{
+ XLogPageHeader page_header;
+
+ Assert((startptr % XLOG_BLCKSZ) == 0);
+
+ /* FIXME: more sensible/efficient implementation */
+ XLogRead(cur_page, startptr, XLOG_BLCKSZ);
+
+ page_header = (XLogPageHeader)cur_page;
+
+ if (page_header->xlp_magic != XLOG_PAGE_MAGIC)
+ {
+ elog(FATAL, "page header magic %x, should be %x at %X/%X", page_header->xlp_magic,
+ XLOG_PAGE_MAGIC, (uint32)(startptr >> 32), (uint32)startptr);
+ }
+}
+
+static
+void decode_begin_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+ elog(WARNING, "BEGIN");
+}
+
+static void
+decode_commit_txn(ApplyCache* cache, ApplyCacheTXN* txn)
+{
+ elog(WARNING, "COMMIT");
+}
+
+/* don't want to include that header */
+extern HeapTuple
+LookupTableByRelFileNode(RelFileNode* r);
+
+
+/* This is is just for demonstration, don't ever use this code for anything real! */
+static void
+decode_change(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change)
+{
+ InvalidateSystemCaches();
+
+ if (change->action == APPLY_CACHE_CHANGE_INSERT)
+ {
+ StringInfoData s;
+ HeapTuple table = LookupTableByRelFileNode(&change->relnode);
+ Form_pg_class class_form;
+ HeapTuple typeTuple;
+ Form_pg_type pt;
+ TupleDesc tupdesc;
+ int i;
+
+ if (!table)
+ {
+ elog(LOG, "couldn't lookup %u", change->relnode.relNode);
+ return;
+ }
+
+ class_form = (Form_pg_class) GETSTRUCT(table);
+
+ initStringInfo(&s);
+
+ tupdesc = lookup_rowtype_tupdesc(class_form->reltype, -1);
+
+ for (i = 0; i < tupdesc->natts; i++)
+ {
+ Oid typid, typoutput;
+ bool typisvarlena;
+ Datum origval, val;
+ char *outputstr;
+ bool isnull;
+ if (tupdesc->attrs[i]->attisdropped)
+ continue;
+ if (tupdesc->attrs[i]->attnum < 0)
+ continue;
+
+ typid = tupdesc->attrs[i]->atttypid;
+
+ typeTuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+ if (!HeapTupleIsValid(typeTuple))
+ elog(ERROR, "cache lookup failed for type %u", typid);
+ pt = (Form_pg_type) GETSTRUCT(typeTuple);
+
+ appendStringInfo(&s, " %s[%s]",
+ NameStr(tupdesc->attrs[i]->attname),
+ NameStr(pt->typname));
+
+ getTypeOutputInfo(typid,
+ &typoutput, &typisvarlena);
+
+ ReleaseSysCache(typeTuple);
+
+ origval = heap_getattr(&change->newtuple->tuple, i + 1, tupdesc, &isnull);
+
+ if (typisvarlena && !isnull)
+ val = PointerGetDatum(PG_DETOAST_DATUM(origval));
+ else
+ val = origval;
+
+ outputstr = OidOutputFunctionCall(typoutput, val);
+
+ appendStringInfo(&s, ":%s", isnull ? "(null)" : outputstr);
+ }
+ ReleaseTupleDesc(tupdesc);
+
+ elog(WARNING, "tuple is:%s", s.data);
+ }
+}
+
+/* test the xlog decoding infrastructure from lsn, to lsn */
+Datum
+decode_xlog(PG_FUNCTION_ARGS)
+{
+ char* start = PG_GETARG_CSTRING(0);
+ char* end = PG_GETARG_CSTRING(1);
+
+ ApplyCache *apply_cache;
+ XLogReaderState *xlogreader_state = XLogReaderAllocate();
+ ReaderApplyState *apply_state;
+
+ XLogRecPtr startpoint;
+ XLogRecPtr endpoint;
+
+ uint32 hi,
+ lo;
+
+ if (sscanf(start, "%X/%X",
+ &hi, &lo) != 2)
+ elog(ERROR, "unparseable xlog pos");
+ startpoint = ((uint64) hi) << 32 | lo;
+
+ elog(LOG, "starting to parse at %X/%X", hi, lo);
+
+ if (sscanf(end, "%X/%X",
+ &hi, &lo) != 2)
+ elog(ERROR, "unparseable xlog pos");
+ endpoint = ((uint64) hi) << 32 | lo;
+
+ elog(LOG, "end parse at %X/%X", hi, lo);
+
+ xlogreader_state->is_record_interesting = replay_record_is_interesting;
+ xlogreader_state->finished_record = replay_finished_record;
+ xlogreader_state->writeout_data = replay_writeout_data;
+ xlogreader_state->read_page = replay_read_page;
+ xlogreader_state->private_data = calloc(1, sizeof(ReaderApplyState));
+
+
+ if (!xlogreader_state->private_data)
+ elog(ERROR, "Could not allocate the ReaderApplyState struct");
+
+ xlogreader_state->startptr = startpoint;
+ xlogreader_state->curptr = startpoint;
+ xlogreader_state->endptr = endpoint;
+
+ apply_state = (ReaderApplyState*)xlogreader_state->private_data;
+
+ /*
+ * allocate an ApplyCache that will apply data using lowlevel calls
+ * without type conversion et al. This requires binary compatibility
+ * between both systems.
+ * XXX: This would be the place too hook different apply methods, like
+ * producing sql and applying it.
+ */
+ apply_cache = ApplyCacheAllocate();
+ apply_cache->begin = decode_begin_txn;
+ apply_cache->apply_change = decode_change;
+ apply_cache->commit = decode_commit_txn;
+
+ apply_state->apply_cache = apply_cache;
+
+ XLogReaderRead(xlogreader_state);
+
+ PG_RETURN_BOOL(true);
+}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
new file mode 100644
index 0000000..05b176d
--- /dev/null
+++ b/src/backend/replication/logical/snapbuild.c
@@ -0,0 +1,1045 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.c
+ *
+ * Support for building timetravel snapshots based on the contents of the
+ * wal
+ *
+ * NOTE:
+ * This is complex, in-progress and underdocumented.
+ *
+ * We build snapshots which can *only* be used to read catalog contents by
+ * reading the wal stream. The aim is to provide mvcc and SnapshotNow
+ * snapshots that behave the same as their respective counterparts would
+ * have at the time the XLogRecord was generated. This is done to provide a
+ * reliable environment for decoding those records into every format that
+ * pleases the user of an ApplyCache.
+ *
+ * The percentage of transactions modifying the catalog should be fairly
+ * small, so instead of keeping track of all running transactions an
+ * treating everything inside (xmin, xmax) thats not running as commited we
+ * do the contrary. That, and other implementation details, neccisate using
+ * our own ->satisfies visibility routine.
+ * In contrast to a class SnapshotNow which doesn't need any data this
+ * module provides something that *behaves* like a SnapshotNow would have
+ * back then (minus some races). Minus some minor things a SnapshotNow
+ * behaves like a SnapshotMVCC taken exactly in the moment the SnapshotNow
+ * was used. Because of that we simply model our timetravel-SnapshotNow's
+ * as mvcc Snapshots.
+ *
+ * To replace the normal handling of SnapshotNow snapshots use the
+ * SetupDecodingSnapshots/RevertFromDecodingSnapshots functions. Be careful
+ * to handle errors properly, otherwise the rest of the session will have
+ * very strange behaviour.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/replication/snapbuild.c
+ *
+ */
+
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/rmgr.h"
+#include "access/transam.h"
+#include "access/xlogreader.h"
+#include "access/xact.h"
+
+#include "catalog/catalog.h"
+#include "catalog/pg_control.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_tablespace.h"
+
+#include "miscadmin.h"
+
+#include "replication/applycache.h"
+#include "replication/snapbuild.h"
+
+#include "utils/builtins.h"
+#include "utils/catcache.h"
+#include "utils/inval.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/snapshot.h"
+#include "utils/syscache.h"
+#include "utils/tqual.h"
+
+#include "storage/standby.h"
+
+typedef struct SnapstateTxnEnt
+{
+ TransactionId xid;
+ bool does_timetravel;
+} SnapstateTxnEnt;
+
+
+static bool
+SnapBuildHasCatalogChanges(Snapstate* snapstate, TransactionId xid, RelFileNode* relfilenode);
+
+/* transaction state manipulation functions */
+static void
+SnapBuildEndTxn(Snapstate* snapstate, TransactionId xid);
+
+static void
+SnapBuildAbortTxn(Snapstate* state, TransactionId xid, int nsubxacts,
+ TransactionId* subxacts);
+
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+ TransactionId* subxacts);
+
+/* ->running manipulation */
+static bool
+SnapBuildTxnRunning(Snapstate* snapstate, TransactionId xid);
+
+static void
+SnapBuildReserveRunning(Snapstate *snapstate, Size count);
+
+static void
+SnapBuildSortRunning(Snapstate *snapstate);
+
+static void
+SnapBuildAddRunningTxn(Snapstate *snapstate, TransactionId xid);
+
+
+/* ->committed manipulation */
+static void
+SnapBuildPurgeCommittedTxn(Snapstate* snapstate);
+
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+ TransactionId* subxacts);
+
+
+/* snapshot building/manipulation/distribution functions */
+static void
+SnapBuildDistributeSnapshotNow(Snapstate* snapstate, TransactionId xid);
+
+static Snapshot
+SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid);
+
+
+HeapTuple
+LookupTableByRelFileNode(RelFileNode* relfilenode)
+{
+ Oid spc;
+
+ InvalidateSystemCaches();
+
+ /*
+ * relations in the default tablespace are stored with a reltablespace = 0
+ * for some reason.
+ */
+ spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+ 0 : relfilenode->spcNode;
+
+ return SearchSysCacheCopy2(RELFILENODE,
+ spc,
+ relfilenode->relNode);
+}
+
+Snapstate*
+AllocateSnapshotBuilder(ApplyCache *applycache)
+{
+ Snapstate *snapstate = malloc(sizeof(Snapstate));
+ HASHCTL hash_ctl;
+
+ snapstate->state = SNAPBUILD_START;
+ snapstate->valid_after = InvalidTransactionId;
+
+ snapstate->nrrunning = 0;
+ snapstate->nrrunning_initial = 0;
+ snapstate->nrrunning_space = 0;
+ snapstate->running = NULL;
+
+ snapstate->nrcommitted = 0;
+ snapstate->nrcommitted_space = 128;
+ snapstate->committed = malloc(snapstate->nrcommitted_space * sizeof(TransactionId));
+ if (!snapstate->committed)
+ elog(ERROR, "could not allocate memory for snapstate->committed");
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(TransactionId);
+ hash_ctl.entrysize = sizeof(SnapstateTxnEnt);
+ hash_ctl.hash = tag_hash;
+ hash_ctl.hcxt = TopMemoryContext;
+
+ snapstate->by_txn = hash_create("SnapstateByXid", 1000, &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION);
+
+ elog(LOG, "allocating snapshotbuilder");
+ return snapstate;
+}
+
+void
+FreeSnapshotBuilder(Snapstate* snapstate)
+{
+ hash_destroy(snapstate->by_txn);
+ free(snapstate);
+}
+
+SnapBuildAction
+SnapBuildCallback(ApplyCache *applycache, Snapstate* snapstate, XLogRecordBuffer* buf)
+{
+ XLogRecord* r = &buf->record;
+ uint8 info = r->xl_info & ~XLR_INFO_MASK;
+ TransactionId xid = buf->record.xl_xid;
+
+ /* relfilenode with the table changes have happened in */
+ bool found_changes = false;
+
+ RelFileNode *relfilenode;
+ SnapBuildAction ret = SNAPBUILD_SKIP;
+
+ {
+ StringInfoData s;
+
+ initStringInfo(&s);
+ RmgrTable[r->xl_rmid].rm_desc(&s,
+ r->xl_info,
+ buf->record_data);
+
+ /* don't bother emitting empty description */
+ if (s.len > 0)
+ elog(LOG,"xlog redo %u: %s", xid, s.data);
+ }
+
+ if (snapstate->state <= SNAPBUILD_FULL_SNAPSHOT)
+ {
+ if (r->xl_rmid == RM_STANDBY_ID &&
+ info == XLOG_RUNNING_XACTS)
+ {
+ xl_running_xacts *running = (xl_running_xacts*)buf->record_data;
+
+ if (!running->subxid_overflow)
+ {
+ snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+
+
+ snapstate->xmin = running->oldestRunningXid;
+ TransactionIdRetreat(snapstate->xmin);
+ snapstate->xmax = running->latestCompletedXid;
+
+ snapstate->nrrunning = running->xcnt;
+ snapstate->nrrunning_initial = running->xcnt;
+ snapstate->nrrunning_space = running->xcnt;
+
+ SnapBuildReserveRunning(snapstate, snapstate->nrrunning_space);
+
+ memcpy(snapstate->running, running->xids,
+ snapstate->nrrunning_initial * sizeof(TransactionId));
+
+ /* sort so we can do a binary search */
+ SnapBuildSortRunning(snapstate);
+
+ if (running->xcnt)
+ {
+ snapstate->xmin_running = snapstate->running[0];
+ snapstate->xmax_running = snapstate->running[running->xcnt - 1];
+ }
+ else
+ {
+ snapstate->xmin_running = InvalidTransactionId;
+ snapstate->xmax_running = InvalidTransactionId;
+ /* FIXME: abort everything considered running */
+ snapstate->state = SNAPBUILD_CONSISTENT;
+ }
+ elog(LOG, "built initial snapshot (via running xacts). Done: %i",
+ snapstate->state == SNAPBUILD_CONSISTENT);
+ }
+ else if (TransactionIdIsValid(snapstate->valid_after))
+ {
+ if (NormalTransactionIdPrecedes(snapstate->valid_after, running->oldestRunningXid))
+ {
+ snapstate->state = SNAPBUILD_FULL_SNAPSHOT;
+ snapstate->xmin_running = InvalidTransactionId;
+ snapstate->xmax_running = InvalidTransactionId;
+ /* FIXME: copy all transactions we have seen starting to ->running */
+ }
+ }
+ else
+ {
+ snapstate->state = SNAPBUILD_INITIAL_POINT;
+
+ snapstate->valid_after = running->nextXid;
+ elog(INFO, "starting to build snapshot, valid_after xid: %u",
+ snapstate->valid_after);
+ }
+ }
+ /* we know nothing has been in progress at this point... */
+ else if (r->xl_rmid == RM_XLOG_ID &&
+ info == XLOG_CHECKPOINT_SHUTDOWN)
+ {
+ CheckPoint* checkpoint = (CheckPoint*)buf->record_data;
+
+ snapstate->xmin = checkpoint->nextXid;
+ snapstate->xmax = checkpoint->nextXid;
+
+ snapstate->nrrunning = 0;
+ snapstate->nrrunning_initial = 0;
+ snapstate->nrrunning_space = 0;
+ free(snapstate->running);
+ snapstate->running = NULL;
+
+ snapstate->state = SNAPBUILD_CONSISTENT;
+
+ elog(LOG, "built initial snapshot (via shutdown)!!!!");
+ /*FIXME: cleanup state */
+ }
+ else if(r->xl_rmid == RM_XLOG_ID &&
+ info == XLOG_CHECKPOINT_ONLINE)
+ {
+ /* FIXME: Check whether there is a valid state dumped to disk */
+ }
+ }
+
+ if (snapstate->state == SNAPBUILD_START)
+ return SNAPBUILD_SKIP;
+
+ switch (r->xl_rmid)
+ {
+ case RM_XLOG_ID:
+ {
+ switch (info)
+ {
+ case XLOG_CHECKPOINT_SHUTDOWN:
+ {
+ CheckPoint* checkpoint = (CheckPoint*)buf->record_data;
+
+ /*
+ * we know nothing can be running anymore, normal
+ * transaction state is sufficient
+ */
+
+ /* no need to have any transaction state anymore */
+#ifdef NOT_YES
+ for (/*FIXME*/)
+ {
+ SnapBuildAbortTxn(snapstate, xid);
+ }
+#endif
+ snapstate->xmin = checkpoint->nextXid;
+ TransactionIdRetreat(snapstate->xmin);
+ snapstate->xmax = checkpoint->nextXid;
+
+ free(snapstate->running);
+ snapstate->running = NULL;
+ snapstate->nrrunning = 0;
+ snapstate->nrrunning_initial = 0;
+ snapstate->nrrunning_space = 0;
+
+ /*FIXME: cleanup state */
+
+
+ ret = SNAPBUILD_DECODE;
+
+ break;
+ }
+ case XLOG_CHECKPOINT_ONLINE:
+ {
+ /* FIXME: dump state to disk so we can restart from here later */
+ break;
+ }
+ }
+ break;
+ }
+ case RM_STANDBY_ID:
+ {
+ switch (info)
+ {
+ case XLOG_RUNNING_XACTS:
+ {
+ xl_running_xacts *running = (xl_running_xacts*)buf->record_data;
+ snapstate->xmin = running->oldestRunningXid;
+ TransactionIdRetreat(snapstate->xmin);
+ snapstate->xmax = running->latestCompletedXid;
+ TransactionIdAdvance(snapstate->xmax);
+
+ SnapBuildPurgeCommittedTxn(snapstate);
+
+ break;
+ }
+ case XLOG_STANDBY_LOCK:
+ break;
+ }
+ break;
+ }
+ case RM_XACT_ID:
+ {
+ switch (info)
+ {
+ case XLOG_XACT_COMMIT:
+ {
+ xl_xact_commit* xlrec =
+ (xl_xact_commit*)buf->record_data;
+
+ SnapBuildCommitTxn(snapstate, xid, xlrec->nsubxacts,
+ (TransactionId*)&xlrec->xnodes);
+ ret = SNAPBUILD_DECODE;
+
+ break;
+ }
+ case XLOG_XACT_COMMIT_COMPACT:
+ {
+ xl_xact_commit_compact* xlrec =
+ (xl_xact_commit_compact*)buf->record_data;
+
+ SnapBuildCommitTxn(snapstate, xid, xlrec->nsubxacts,
+ xlrec->subxacts);
+ ret = SNAPBUILD_DECODE;
+ break;
+ }
+ case XLOG_XACT_COMMIT_PREPARED:
+ {
+ xl_xact_commit_prepared* xlrec =
+ (xl_xact_commit_prepared*)buf->record_data;
+
+ SnapBuildCommitTxn(snapstate, xid, xlrec->crec.nsubxacts,
+ (TransactionId*)&xlrec->crec.xnodes);
+ ret = SNAPBUILD_DECODE;
+ break;
+ }
+ case XLOG_XACT_ABORT:
+ {
+ xl_xact_abort* xlrec =
+ (xl_xact_abort*)buf->record_data;
+
+ SnapBuildAbortTxn(snapstate, xid, xlrec->nsubxacts,
+ (TransactionId*)&xlrec->xnodes);
+ ret = SNAPBUILD_DECODE;
+
+ }
+ case XLOG_XACT_ABORT_PREPARED:
+ {
+ xl_xact_abort_prepared* xlrec =
+ (xl_xact_abort_prepared*)buf->record_data;
+
+ SnapBuildAbortTxn(snapstate, xid, xlrec->arec.nsubxacts,
+ (TransactionId*)&xlrec->arec.xnodes);
+ ret = SNAPBUILD_DECODE;
+ }
+ case XLOG_XACT_ASSIGNMENT:
+ case XLOG_XACT_PREPARE: /* boring? */
+ default:
+ break;
+ ;
+ }
+ break;
+ }
+ case RM_HEAP_ID:
+ {
+ switch (info & XLOG_HEAP_OPMASK)
+ {
+ /* XXX: this only happens for "irrelevant" changes? Ignore for now */
+ case XLOG_HEAP_INPLACE:
+ {
+ xl_heap_inplace *xlrec = (xl_heap_inplace*)buf->record_data;
+ relfilenode = &xlrec->target.node;
+ found_changes = false; /* <----- LOOK */
+ break;
+ }
+ /*
+ * we only ever read changes, so row level locks aren't
+ * interesting
+ */
+ case XLOG_HEAP_LOCK:
+ break;
+
+ case XLOG_HEAP_INSERT:
+ {
+ xl_heap_insert *xlrec = (xl_heap_insert*)buf->record_data;
+ relfilenode = &xlrec->target.node;
+ found_changes = true;
+ break;
+ }
+ case XLOG_HEAP_UPDATE:
+ case XLOG_HEAP_HOT_UPDATE:
+ {
+ xl_heap_update *xlrec = (xl_heap_update*)buf->record_data;
+ relfilenode = &xlrec->target.node;
+ found_changes = true;
+ break;
+ }
+ case XLOG_HEAP_DELETE:
+ {
+ xl_heap_delete *xlrec = (xl_heap_delete*)buf->record_data;
+ relfilenode = &xlrec->target.node;
+ found_changes = true;
+ break;
+ }
+ default:
+ ;
+ }
+ break;
+ }
+ case RM_HEAP2_ID:
+ {
+ /* some HEAP2 things don't necessarily happen in a transaction? */
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ switch (info)
+ {
+ case XLOG_HEAP2_MULTI_INSERT:
+ {
+ xl_heap_multi_insert *xlrec =
+ (xl_heap_multi_insert*)buf->record_data;
+
+ relfilenode = &xlrec->node;
+
+ found_changes = true;
+
+ /*
+ * we only decode the first tuple as all the following ones
+ * will have the same cmin (and no cmax)
+ */
+ break;
+ }
+ default:
+ ;
+ }
+ }
+ break;
+ }
+
+
+
+ if (found_changes)
+ {
+ /*
+ * we unfortunately cannot access the catalog of other databases, so
+ * don't think about changes in them
+ */
+ if (relfilenode->dbNode != MyDatabaseId)
+ ;
+ /*
+ * we need to keep track of new transactions while we didn't know what
+ * was already running. Only actual data changes are relevant, so its
+ * fine to track them here.
+ */
+ else if (snapstate->state < SNAPBUILD_FULL_SNAPSHOT)
+ SnapBuildAddRunningTxn(snapstate, xid);
+ /*
+ * No point in keeping track of changes in transactions that we don't
+ * have enough information about to decode.
+ */
+ else if (snapstate->state < SNAPBUILD_CONSISTENT &&
+ SnapBuildTxnRunning(snapstate, xid))
+ ;
+ else
+ {
+ bool does_timetravel;
+ bool old_tx = ApplyCacheIsXidKnown(applycache, xid);
+ bool found;
+ SnapstateTxnEnt *ent;
+
+ Assert(TransactionIdIsNormal(xid));
+ Assert(!SnapBuildTxnRunning(snapstate, xid));
+
+
+
+ ent = hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_FIND,
+ &found);
+
+ /* FIXME: For now skip transactions with catalog changes entirely */
+ if (ent && ent->does_timetravel)
+ does_timetravel = true;
+ else
+ does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode);
+
+ /*
+ * we don't add catalog changes to the applycache, we could use
+ * them to queue local cache inval messages for catalog tables if
+ * the relmapper would map from relfilenode to relid with correct
+ * visibility rules.
+ */
+ if (!does_timetravel)
+ ret = SNAPBUILD_DECODE;
+
+ elog(LOG, "found changes in xid %u (known: %u), timetravel: %i",
+ xid, old_tx, does_timetravel);
+
+ /*
+ * FIXME: At this point we have might have a problem if somebody
+ * would CLUSTER, REINDEX or similar a system table inside a
+ * transaction and *also* does other catalog modifications because
+ * we can only build proper snapshots to look at the catalog after
+ * we have reached the commit record because only then we know the
+ * subxids of a toplevel txid. Because we wouldn't notice the
+ * changed system table relfilenodes we wouldn't see the any of
+ * those catalog changes.
+ *
+ * So we need to forbid that.
+ */
+
+ if (!old_tx)
+ {
+ /* update global snapshot information */
+ if (does_timetravel)
+ {
+ ent = hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_FIND|HASH_ENTER,
+ &found);
+
+ elog(LOG, "found catalog change in tx %u without changes, did we know it: %u",
+ xid, found);
+
+ ent->does_timetravel = true;
+
+ }
+ else
+ {
+ elog(LOG, "adding initial snapshot to xid %u", xid);
+ }
+
+ /* add initial snapshot*/
+ {
+ Snapshot snap = SnapBuildBuildSnapshot(snapstate, xid);
+
+ elog(LOG, "adding base snap");
+ ApplyCacheAddBaseSnapshot(applycache, xid,
+ InvalidXLogRecPtr,
+ snap);
+ }
+
+ }
+ /* update already distributed snapshots */
+ else if (does_timetravel && old_tx)
+ {
+ /*
+ * check whether we already know the xid as a catalog modifying
+ * one
+ */
+ SnapstateTxnEnt *ent =
+ hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_FIND|HASH_ENTER,
+ &found);
+
+ elog(LOG, "found catalog change in tx %u with changes, did we know it: %u",
+ xid, found);
+
+ ent->does_timetravel = true;
+
+ /* FIXME: add a new CommandId to the applycache's ->changes queue */
+ }
+ }
+ }
+
+ return ret;
+}
+
+
+/* Does this relation carry catalog information */
+static bool
+SnapBuildHasCatalogChanges(Snapstate* snapstate, TransactionId xid, RelFileNode* relfilenode)
+{
+ /* FIXME: build snapshot for transaction */
+ HeapTuple table;
+ Form_pg_class class_form;
+
+ Snapshot snap = SnapBuildBuildSnapshot(snapstate, xid);
+
+ if (relfilenode->spcNode == GLOBALTABLESPACE_OID)
+ return true;
+
+ SetupDecodingSnapshots(snap);
+
+ InvalidateSystemCaches();
+
+ table = LookupTableByRelFileNode(relfilenode);
+
+ RevertFromDecodingSnapshots();
+ InvalidateSystemCaches();
+
+ /*
+ * tables in the default tablespace are stored in pg_class with 0 as their
+ * reltablespace
+ */
+ if (!HeapTupleIsValid(table))
+ {
+ if (relfilenode->relNode >= FirstNormalObjectId)
+ {
+ elog(WARNING, "failed pg_class lookup for %u:%u with a oid in >= FirstNormalObjectId",
+ relfilenode->spcNode, relfilenode->relNode);
+ }
+ return true;
+ }
+
+ class_form = (Form_pg_class) GETSTRUCT(table);
+
+ return IsSystemClass(class_form);
+}
+
+/* build a new snapshot, based on currently committed transactions */
+static Snapshot
+SnapBuildBuildSnapshot(Snapstate *snapstate, TransactionId xid)
+{
+ Snapshot snapshot = malloc(sizeof(SnapshotData) +
+ sizeof(TransactionId) * snapstate->nrcommitted +
+ sizeof(TransactionId) * 1 /* toplevel xid */);
+
+ snapshot->satisfies = HeapTupleSatisfiesMVCCDuringDecoding;
+ /*
+ * we copy all currently in progress transaction to ->xip, all transactions
+ * added to the transaction that committed during running - which thus need
+ * to be considered visible in SnapshotNow semantics - get copied to
+ * ->subxip.
+ * XXX: Do we want extra fileds for those two instead?
+ */
+ snapshot->xmin = snapstate->xmin;
+ snapshot->xmax = snapstate->xmax;
+
+ /* store all transaction to be treated as committed */
+ snapshot->xip = (TransactionId*)((char*)snapshot + sizeof(SnapshotData));
+
+ snapshot->xcnt = snapstate->nrcommitted;
+ memcpy(snapshot->xip, snapstate->committed,
+ snapstate->nrcommitted * sizeof(TransactionId));
+
+ /* sort so we can bsearch() */
+ qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+
+ /* store toplevel xid */
+ /*
+ * FIXME: subtransaction handling currently needs to be done in
+ * applycache. Yuck.
+ */
+ snapshot->subxip = (TransactionId*)(
+ (char*)snapshot
+ + sizeof(SnapshotData) /* offset to ->xip's data */
+ + sizeof(TransactionId) * snapstate->nrcommitted /* data */
+ );
+
+ snapshot->subxcnt = 1;
+ snapshot->subxip[0] = xid;
+
+ snapshot->suboverflowed = false;
+ snapshot->takenDuringRecovery = false;
+ snapshot->copied = false;
+ snapshot->curcid = 0;
+ snapshot->active_count = 0;
+ snapshot->regd_count = 0;
+
+ return snapshot;
+}
+
+/* check whether `xid` is currently running */
+static bool
+SnapBuildTxnRunning(Snapstate* snapstate, TransactionId xid)
+{
+ if (snapstate->nrrunning &&
+ NormalTransactionIdFollows(xid, snapstate->xmin_running) &&
+ NormalTransactionIdPrecedes(xid, snapstate->xmax_running))
+ {
+ TransactionId* xid =
+ bsearch(&xid, snapstate->running, snapstate->nrrunning_initial,
+ sizeof(TransactionId), xidComparator);
+
+ if (xid != NULL)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * add a new SnapshotNow to all transactions were decoding that are currently
+ * in-progress so they can see new catalog contents.
+ */
+static void
+SnapBuildDistributeSnapshotNow(Snapstate* snapstate, TransactionId xid)
+{
+ /* FIXME: implement */
+}
+
+/*
+ * Keep track of a new catalog changing transaction that has committed
+ */
+static void
+SnapBuildAddCommittedTxn(Snapstate* snapstate, TransactionId xid)
+{
+ if (snapstate->nrcommitted == snapstate->nrcommitted_space)
+ {
+ elog(WARNING, "increasing space for committed transactions");
+
+ snapstate->nrcommitted_space *= 2;
+ snapstate->committed = realloc(snapstate->committed,
+ snapstate->nrcommitted_space * sizeof(TransactionId));
+ if (!snapstate->committed)
+ elog(ERROR, "couldn't enlarge space for committed transactions");
+ }
+ snapstate->committed[snapstate->nrcommitted++] = xid;
+}
+
+/*
+ * Remove all transactions we treat as committed that are smaller than
+ * ->xmin. Those won't ever get checked via the ->commited array anyway.
+ */
+static void
+SnapBuildPurgeCommittedTxn(Snapstate* snapstate)
+{
+ int off;
+ TransactionId *workspace;
+ int surviving_xids = 0;
+
+ /* FIXME: Neater algorithm? */
+ workspace = malloc(snapstate->nrcommitted * sizeof(TransactionId));
+
+ if (!workspace)
+ elog(ERROR, "could not allocate memory for workspace during xmin purging");
+
+ for (off = 0; off < snapstate->nrcommitted; off++)
+ {
+ if (snapstate->committed[off] > snapstate->xmin)
+ workspace[surviving_xids++] = snapstate->committed[off];
+ }
+
+ memcpy(snapstate->committed, workspace,
+ surviving_xids * sizeof(TransactionId));
+
+ snapstate->nrcommitted = surviving_xids;
+ free(workspace);
+}
+
+/*
+ * makes sure we have enough space for at least `count` additional txn's,
+ * reallocates if necessary
+ */
+static void
+SnapBuildReserveRunning(Snapstate *snapstate, Size count)
+{
+ const Size reserve = 100;
+
+ if (snapstate->nrrunning_initial + count < snapstate->nrrunning_space)
+ return;
+
+ if (snapstate->running)
+ {
+ snapstate->nrrunning_space += count + reserve;
+ snapstate->running =
+ realloc(snapstate->running,
+ snapstate->nrrunning_space *
+ sizeof(TransactionId));
+ if (!snapstate->running)
+ elog(ERROR, "could not reallocate ->running");
+ }
+ else
+ {
+ snapstate->nrrunning_space = count + reserve;
+ snapstate->running = malloc(snapstate->nrrunning_space
+ * sizeof(TransactionId));
+ }
+}
+
+/*
+ * To allow binary search in the set of running transactions, sort them with
+ * xidComparator.
+ */
+static void
+SnapBuildSortRunning(Snapstate *snapstate)
+{
+ qsort(snapstate->running, snapstate->nrrunning_initial,
+ sizeof(TransactionId), xidComparator);
+}
+
+/*
+ * Add transaction to the set of currently runnign transactions.
+ */
+static void
+SnapBuildAddRunningTxn(Snapstate *snapstate, TransactionId xid)
+{
+ Assert(snapstate->state == SNAPBUILD_INITIAL_POINT &&
+ TransactionIdIsValid(snapstate->valid_after));
+
+ /*
+ * we only need those running txn's if were switching state due to reaching
+ * the xmin horizon. Transactions before we reached that are not
+ * interesting.
+ */
+ if (NormalTransactionIdPrecedes(xid, snapstate->valid_after) )
+ return;
+
+ if (SnapBuildTxnRunning(snapstate, xid))
+ return;
+
+ Assert(!TransactionIdPrecedesOrEquals(xid, snapstate->xmin_running));
+
+ if (TransactionIdFollowsOrEquals(xid, snapstate->xmax_running))
+ snapstate->xmax_running = xid;
+
+ SnapBuildReserveRunning(snapstate, 1);
+
+ /* FIXME: inefficient insertion logic, should at least be insertion sort */
+ snapstate->running[snapstate->nrrunning_initial++] = xid;
+ snapstate->nrrunning++;
+ SnapBuildSortRunning(snapstate);
+}
+
+/*
+ * Common logic for SnapBuildAbortTxn and SnapBuildCommitTxn dealing with
+ * keeping track of the amount of running transactions.
+ */
+static void
+SnapBuildEndTxn(Snapstate* snapstate, TransactionId xid)
+{
+ if (snapstate->state == SNAPBUILD_CONSISTENT)
+ return;
+
+ if (SnapBuildTxnRunning(snapstate, xid))
+ {
+ if (!--snapstate->nrrunning)
+ {
+ /*
+ * none of the originally running transaction is running
+ * anymore. Due to that our incrementaly built snapshot now is
+ * complete.
+ */
+ elog(LOG, "found consistent point due to SnapBuildEndTxn + running: %u", xid);
+ snapstate->state = SNAPBUILD_CONSISTENT;
+ }
+ }
+}
+
+/* Abort a transaction, throw away all state we kept */
+static void
+SnapBuildAbortTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts, TransactionId* subxacts)
+{
+ bool found;
+ int i;
+
+ for(i = 0; i < nsubxacts; i++)
+ {
+ TransactionId subxid = subxacts[i];
+ SnapBuildEndTxn(snapstate, subxid);
+
+ hash_search(snapstate->by_txn,
+ (void *)&subxid,
+ HASH_REMOVE,
+ &found);
+
+ }
+
+ SnapBuildEndTxn(snapstate, xid);
+
+ hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_REMOVE,
+ &found);
+}
+
+/* Handle everything that needs to be done when a transaction commits */
+static void
+SnapBuildCommitTxn(Snapstate* snapstate, TransactionId xid, int nsubxacts,
+ TransactionId* subxacts)
+{
+ int off;
+ bool found;
+ bool forced_timetravel = false;
+ bool sub_does_timetravel = false;
+ SnapstateTxnEnt *ent;
+
+ /*
+ * If we couldn't observe every change of a transaction because it was
+ * already running at the point we started to observe we have to assume it
+ * made catalog changes.
+ */
+ if (snapstate->state < SNAPBUILD_CONSISTENT && SnapBuildTxnRunning(snapstate, xid))
+ {
+ elog(LOG, "forced to assume catalog changes for xid %u because it was running to early", xid);
+ SnapBuildAddCommittedTxn(snapstate, xid);
+ forced_timetravel = true;
+ }
+
+ for(off = 0; off < nsubxacts; off++)
+ {
+ TransactionId subxid = subxacts[off];
+
+ SnapBuildEndTxn(snapstate, subxid);
+
+ ent = hash_search(snapstate->by_txn,
+ (void *)&subxid,
+ HASH_FIND,
+ &found);
+
+ if (forced_timetravel)
+ {
+ SnapBuildAddCommittedTxn(snapstate, subxid);
+ }
+ /* add subtransaction to base snapshot, we don't distinguish after that */
+ else if (found && ent->does_timetravel)
+ {
+ sub_does_timetravel = true;
+
+ elog(WARNING, "found subtransaction %u:%u with catalog changes",
+ xid, subxid);
+
+ SnapBuildAddCommittedTxn(snapstate, subxid);
+ }
+
+ /* make sure its not tracked in running txn's anymore, switch state */
+ SnapBuildEndTxn(snapstate, subxid);
+
+ if (found)
+ {
+ hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_REMOVE,
+ &found);
+ Assert(found);
+ }
+
+ if (NormalTransactionIdFollows(subxid, snapstate->xmax))
+ {
+ snapstate->xmax = subxid;
+ TransactionIdAdvance(snapstate->xmax);
+ }
+ }
+
+ /* make sure its not tracked in running txn's anymore, switch state */
+ SnapBuildEndTxn(snapstate, xid);
+
+ ent =
+ hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_FIND,
+ &found);
+
+ /* add toplevel transaction to base snapshot */
+ if (found && ent->does_timetravel)
+ {
+ elog(DEBUG1, "found top level transaction %u, with catalog changes !!!!", xid);
+ SnapBuildAddCommittedTxn(snapstate, xid);
+ }
+
+ if ((found && ent->does_timetravel) || sub_does_timetravel || forced_timetravel)
+ {
+ elog(DEBUG1, "found transaction %u, with catalog changes !!!!", xid);
+
+ /* add a new SnapshotNow to all currently running transactions */
+ SnapBuildDistributeSnapshotNow(snapstate, xid);
+ }
+
+ if (found)
+ {
+ /* now we don't need the contents anymore, remove */
+ hash_search(snapstate->by_txn,
+ (void *)&xid,
+ HASH_REMOVE,
+ &found);
+ Assert(found);
+ }
+
+ if (NormalTransactionIdFollows(xid, snapstate->xmax))
+ {
+ snapstate->xmax = xid;
+ TransactionIdAdvance(snapstate->xmax);
+ }
+}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index b531db5..25af26a 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -65,6 +65,7 @@
#include "storage/bufmgr.h"
#include "storage/procarray.h"
#include "utils/tqual.h"
+#include "utils/builtins.h"
/* Static variables representing various special snapshot semantics */
@@ -73,6 +74,8 @@ SnapshotData SnapshotSelfData = {HeapTupleSatisfiesSelf};
SnapshotData SnapshotAnyData = {HeapTupleSatisfiesAny};
SnapshotData SnapshotToastData = {HeapTupleSatisfiesToast};
+static Snapshot SnapshotNowDecoding;
+
/* local functions */
static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
@@ -1375,3 +1378,161 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
return false;
}
+
+static bool
+TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
+{
+ return bsearch(&xid, xip, num,
+ sizeof(TransactionId), xidComparator) != NULL;
+}
+
+
+/*
+ * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+ * obeys.
+ *
+ * Only usable on tuples from catalog tables!
+ *
+ * We don't need to support HEAP_MOVED_(IN|OFF) for now because we only support
+ * reading catalog pages which couldn't have been created in an older version.
+ *
+ * Basically we record all transactions that are in progress when the
+ * transaction starts and treat them as in-progress for the duration of the
+ * snapshot, everything below xmin is comitted, everything above xmax is
+ * in-progress, and everything thats not in our in-progress array is committed
+ * as well.
+ */
+bool
+HeapTupleSatisfiesMVCCDuringDecoding(HeapTupleHeader tuple, Snapshot snapshot,
+ Buffer buffer)
+{
+ TransactionId xmin = HeapTupleHeaderGetXmin(tuple);
+ TransactionId xmax = HeapTupleHeaderGetXmax(tuple);
+
+ /*
+ * FIXME: The not yet existing decoding infrastructure will need to force
+ * the xmin to stay lower than what they are currently decoding.
+ */
+ bool fixme_xmin_horizon = false;
+
+ if (fixme_xmin_horizon && tuple->t_infomask & HEAP_XMIN_INVALID)
+ {
+ return false;
+ }
+ /* normal transaction state counts */
+ else if (TransactionIdPrecedes(xmin, snapshot->xmin))
+ {
+ if (!TransactionIdDidCommit(xmin))
+ return false;
+ }
+ /* beyond our xmax horizon, i.e. invisible */
+ else if (TransactionIdFollows(xmin, snapshot->xmax))
+ {
+ return false;
+ }
+ /* check if its one of our txids, toplevel is also in there */
+ else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+ {
+ CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+ /* no support for that yet */
+ if (tuple->t_infomask & HEAP_COMBOCID){
+ elog(WARNING, "combocids not yet supported");
+ return false;
+ }
+ if (cmin >= snapshot->curcid)
+ return false; /* inserted after scan started */
+ }
+ /* check if we know the transaction has committed */
+ else if(TransactionIdInArray(xmin, snapshot->xip, snapshot->xcnt))
+ {
+ }
+ else
+ {
+ return false;
+ }
+
+ /* at this point we know xmin is visible */
+
+ /* why should those be in catalog tables? */
+ Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
+
+ if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
+ return true;
+
+ if (tuple->t_infomask & HEAP_IS_LOCKED)
+ return true;
+
+ /* we cannot possibly see the deleting transaction */
+ if (TransactionIdFollows(xmax, snapshot->xmax))
+ {
+ return true;
+ }
+ /* normal transaction state is valid */
+ else if (TransactionIdPrecedes(xmax, snapshot->xmin))
+ {
+ return !TransactionIdDidCommit(xmax);
+ }
+ /* check if its one of our txids, toplevel is also in there */
+ else if (TransactionIdInArray(xmax, snapshot->subxip, snapshot->subxcnt))
+ {
+ CommandId cmax = HeapTupleHeaderGetRawCommandId(tuple);
+ /* no support for that yet */
+ if (tuple->t_infomask & HEAP_COMBOCID){
+ elog(WARNING, "combocids not yet supported");
+ return true;
+ }
+
+ if (cmax >= snapshot->curcid)
+ return true; /* deleted after scan started */
+ else
+ return false; /* deleted before scan started */
+ }
+ /* do we know that the deleting txn is valid? */
+ else if (TransactionIdInArray(xmax, snapshot->xip, snapshot->xcnt))
+ {
+ return false;
+ }
+ else
+ {
+ return true;
+ }
+}
+
+static bool
+FailsSatisfies(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+{
+ elog(ERROR, "should not be called after SetupDecodingSnapshots!");
+ return false;
+}
+
+static bool
+RedirectSatisfiesNow(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer)
+{
+ Assert(SnapshotNowDecoding != NULL);
+ return HeapTupleSatisfiesMVCCDuringDecoding(tuple, SnapshotNowDecoding,
+ buffer);
+}
+
+void
+SetupDecodingSnapshots(Snapshot snapshot_now)
+{
+ SnapshotNowData.satisfies = RedirectSatisfiesNow;
+ SnapshotSelfData.satisfies = FailsSatisfies;
+ SnapshotAnyData.satisfies = FailsSatisfies;
+ SnapshotToastData.satisfies = FailsSatisfies;
+
+ SnapshotNowDecoding = snapshot_now;
+}
+
+
+void
+RevertFromDecodingSnapshots(void)
+{
+ SnapshotNowDecoding = NULL;
+
+ SnapshotNowData.satisfies = HeapTupleSatisfiesNow;
+ SnapshotSelfData.satisfies = HeapTupleSatisfiesSelf;
+ SnapshotAnyData.satisfies = HeapTupleSatisfiesAny;
+ SnapshotToastData.satisfies = HeapTupleSatisfiesToast;
+
+}
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 228f6a1..915b2cd 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -63,6 +63,11 @@
(AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
(int32) ((id1) - (id2)) < 0)
+/* compare two XIDs already known to be normal; this is a macro for speed */
+#define NormalTransactionIdFollows(id1, id2) \
+ (AssertMacro(TransactionIdIsNormal(id1) && TransactionIdIsNormal(id2)), \
+ (int32) ((id1) - (id2)) > 0)
+
/* ----------
* Object ID (OID) zero is InvalidOid.
*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d88248a..b5b886b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4655,6 +4655,9 @@ DESCR("SP-GiST support for suffix tree over text");
DATA(insert OID = 4031 ( spg_text_leaf_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2281 2281" _null_ _null_ _null_ _null_ spg_text_leaf_consistent _null_ _null_ _null_ ));
DESCR("SP-GiST support for suffix tree over text");
+DATA(insert OID = 4033 ( decode_xlog PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2275 2275" _null_ _null_ _null_ _null_ decode_xlog _null_ _null_ _null_ ));
+DESCR("decode xlog");
+
DATA(insert OID = 3469 ( spg_range_quad_config PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2278 "2281 2281" _null_ _null_ _null_ _null_ spg_range_quad_config _null_ _null_ _null_ ));
DESCR("SP-GiST support for quad tree over range");
DATA(insert OID = 3470 ( spg_range_quad_choose PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2278 "2281 2281" _null_ _null_ _null_ _null_ spg_range_quad_choose _null_ _null_ _null_ ));
diff --git a/src/include/replication/applycache.h b/src/include/replication/applycache.h
new file mode 100644
index 0000000..f101eeb
--- /dev/null
+++ b/src/include/replication/applycache.h
@@ -0,0 +1,239 @@
+/*
+ * applycache.h
+ *
+ * PostgreSQL logical replay "cache" management
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/replication/applycache.h
+ */
+#ifndef APPLYCACHE_H
+#define APPLYCACHE_H
+
+#include "access/htup_details.h"
+#include "utils/hsearch.h"
+#include "utils/ilist.h"
+#include "utils/snapshot.h"
+
+typedef struct ApplyCache ApplyCache;
+
+enum ApplyCacheChangeType
+{
+ APPLY_CACHE_CHANGE_INSERT,
+ APPLY_CACHE_CHANGE_UPDATE,
+ APPLY_CACHE_CHANGE_DELETE,
+ /*
+ * for efficiency and simplicity reasons we keep those in the same list,
+ * thats somewhat annoying because switch()es warn if those aren't
+ * handled... Make those private values?
+ */
+ APPLY_CACHE_CHANGE_SNAPSHOT,
+ APPLY_CACHE_CHANGE_COMMAND_ID
+};
+
+typedef struct ApplyCacheTupleBuf
+{
+ /* position in preallocated list */
+ ilist_s_node node;
+
+ HeapTupleData tuple;
+ HeapTupleHeaderData header;
+ char data[MaxHeapTupleSize];
+} ApplyCacheTupleBuf;
+
+typedef struct ApplyCacheChange
+{
+ XLogRecPtr lsn;
+ enum ApplyCacheChangeType action;
+
+ RelFileNode relnode;
+
+ union {
+ ApplyCacheTupleBuf* newtuple;
+ Snapshot snapshot;
+ CommandId command_id;
+ };
+ ApplyCacheTupleBuf* oldtuple;
+
+
+ HeapTuple table;
+
+ /*
+ * While in use this is how a change is linked into a transactions,
+ * otherwise its the preallocated list.
+ */
+ ilist_d_node node;
+} ApplyCacheChange;
+
+typedef struct ApplyCacheTXN
+{
+ TransactionId xid;
+
+ XLogRecPtr lsn;
+
+ /*
+ * How many ApplyCacheChange's do we have in this txn.
+ *
+ * Subtransactions are *not* included.
+ */
+ Size nentries;
+
+ /*
+ * How many of the above entries are stored in memory in contrast to being
+ * spilled to disk.
+ */
+ Size nentries_mem;
+
+ /*
+ * List of actual changes
+ */
+ ilist_d_head changes;
+
+ /*
+ * non-hierarchical list of subtransactions that are *not* aborted
+ */
+ ilist_d_head subtxns;
+
+ /*
+ * our position in a list of subtransactions while the TXN is in
+ * use. Otherwise its the position in the list of preallocated
+ * transactions.
+ */
+ ilist_d_node node;
+
+ /*
+ * List of (lsn, command_id).
+ *
+ * Everytime a catalog change happens this list gets appended with the
+ * current commandid. This is used to be able to construct proper
+ * Snapshot's for decoding.
+ */
+ ilist_d_head commandids;
+
+ /*
+ * List of (lsn, Snapshot) pairs.
+ *
+ * The first record always is the (InvalidXLogRecPtr, SnapshotAtStart)
+ * pair. Everytime *another* transaction commits this gets appended with a
+ * new Snapshot that has enough information to make SnapshotNow lookups.
+ */
+ ilist_d_head snapshots;
+} ApplyCacheTXN;
+
+
+/* XXX: were currently passing the originating subtxn. Not sure thats necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache, ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+
+/*
+ * max number of concurrent top-level transactions or transaction where we
+ * don't know if they are top-level can be calculated by:
+ * (max_connections + max_prepared_xactx + ?) * PGPROC_MAX_CACHED_SUBXIDS
+ */
+struct ApplyCache
+{
+ /*
+ * Should snapshots for decoding be collected. If many catalog changes
+ * happen this can be considerably expensive.
+ */
+ bool build_snapshots;
+
+ TransactionId last_txn;
+ ApplyCacheTXN *last_txn_cache;
+ HTAB *by_txn;
+
+ ApplyCacheBeginCB begin;
+ ApplyCacheApplyChangeCB apply_change;
+ ApplyCacheCommitCB commit;
+
+ void* private_data;
+
+ MemoryContext context;
+
+ /*
+ * we don't want to repeatedly (de-)allocated those structs, so cache them for reusage.
+ */
+ ilist_d_head cached_transactions;
+ size_t nr_cached_transactions;
+
+ ilist_d_head cached_changes;
+ size_t nr_cached_changes;
+
+ ilist_s_head cached_tuplebufs;
+ size_t nr_cached_tuplebufs;
+};
+
+
+ApplyCache*
+ApplyCacheAllocate(void);
+
+void
+ApplyCacheFree(ApplyCache*);
+
+ApplyCacheTupleBuf*
+ApplyCacheGetTupleBuf(ApplyCache*);
+
+void
+ApplyCacheReturnTupleBuf(ApplyCache* cache, ApplyCacheTupleBuf* tuple);
+
+/*
+ * Returns a (potentically preallocated) change struct. Its lifetime is managed
+ * by the applycache module.
+ *
+ * If not added to a transaction with ApplyCacheAddChange it needs to be
+ * returned via ApplyCacheReturnChange
+ *
+ * FIXME: better name
+ */
+ApplyCacheChange*
+ApplyCacheGetChange(ApplyCache*);
+
+/*
+ * Return an unused ApplyCacheChange struct
+ */
+void
+ApplyCacheReturnChange(ApplyCache*, ApplyCacheChange*);
+
+
+/*
+ * record the transaction as in-progress if not already done, add the current
+ * change.
+ *
+ * We have a one-entry cache for lookin up the current ApplyCacheTXN so we
+ * don't need to do a full hash-lookup if the same xid is used
+ * sequentially. Them being used multiple times that way is rather frequent.
+ */
+void
+ApplyCacheAddChange(ApplyCache*, TransactionId, XLogRecPtr lsn, ApplyCacheChange*);
+
+/*
+ *
+ */
+void
+ApplyCacheCommit(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheCommitChild(ApplyCache*, TransactionId, TransactionId, XLogRecPtr lsn);
+
+void
+ApplyCacheAbort(ApplyCache*, TransactionId, XLogRecPtr lsn);
+
+typedef struct SnapshotData* Snapshot;
+
+/*
+ * if lsn == InvalidXLogRecPtr this is the first snap for the transaction
+ */
+void
+ApplyCacheAddBaseSnapshot(ApplyCache*, TransactionId, XLogRecPtr lsn, Snapshot snap);
+
+/*
+ * Will only be called for command ids > 1
+ */
+void
+ApplyCacheAddNewCommandId(ApplyCache*, TransactionId, XLogRecPtr lsn, CommandId cid);
+
+bool
+ApplyCacheIsXidKnown(ApplyCache* cache, TransactionId xid);
+#endif
diff --git a/src/include/replication/decode.h b/src/include/replication/decode.h
new file mode 100644
index 0000000..86312d1
--- /dev/null
+++ b/src/include/replication/decode.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ * decode.h
+ * PostgreSQL WAL to logical transformation
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef DECODE_H
+#define DECODE_H
+
+#include "access/xlogreader.h"
+#include "replication/applycache.h"
+
+struct Snapstate;
+
+typedef struct ReaderApplyState
+{
+ ApplyCache *apply_cache;
+ struct Snapstate *snapstate;
+} ReaderApplyState;
+
+void DecodeRecordIntoApplyCache(ReaderApplyState* state, XLogRecordBuffer* buf);
+
+#endif
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
new file mode 100644
index 0000000..ed92e75
--- /dev/null
+++ b/src/include/replication/snapbuild.h
@@ -0,0 +1,119 @@
+/*-------------------------------------------------------------------------
+ *
+ * snapbuild.h
+ * Exports from replication/logical/snapbuild.c.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * src/include/replication/snapbuild.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SNAPBUILD_H
+#define SNAPBUILD_H
+
+#include "replication/applycache.h"
+
+#include "utils/hsearch.h"
+#include "utils/snapshot.h"
+#include "access/htup.h"
+
+typedef enum
+{
+ SNAPBUILD_START,
+ /*
+ * found initial visibility information.
+ *
+ * Thats either: XLOG_RUNNING_XACTS or XLOG_CHECKPOINT_SHUTDOWN
+ */
+ SNAPBUILD_INITIAL_POINT,
+ /*
+ * We have collected enough information to decode tuples in transactions
+ * that started after this.
+ *
+ * Once we reached this we start to collect changes. We cannot apply them
+ * yet because the might be based on transactions that were still running
+ * when we reached them yet.
+ */
+ SNAPBUILD_FULL_SNAPSHOT,
+ /*
+ * Found a point after hitting built_full_snapshot where all transactions
+ * that were running at that point finished. Till we reach that we hold off
+ * calling any commit callbacks.
+ */
+ SNAPBUILD_CONSISTENT
+} SnapBuildState;
+
+typedef enum
+{
+ SNAPBUILD_SKIP,
+ SNAPBUILD_DECODE
+} SnapBuildAction;
+
+typedef struct Snapstate
+{
+ SnapBuildState state;
+
+ /* all transactions smaller than this have committed/aborted */
+ TransactionId xmin;
+
+ /* all transactions bigger than this are uncommitted */
+ TransactionId xmax;
+
+ /*
+ * All transactions in this window have to be checked via the running
+ * array. This will only be used initially till we are past xmax_running.
+ *
+ * Note that we initially assume treat already running transactions to have
+ * catalog modifications because we don't have enough information about
+ * them to properly judge that.
+ */
+ TransactionId xmin_running;
+ TransactionId xmax_running;
+
+ /* sorted array of running transactions, can be searched with bsearch() */
+ TransactionId* running;
+ /* how many running transactions remain */
+ size_t nrrunning;
+ /* how much free space do we have to add more running txn's */
+ size_t nrrunning_space;
+ /*
+ * we need to keep track of the amount of tracked transactions separately
+ * from nrrunning_space as nrunning_initial gives the range of valid xids
+ * in the array so bsearch() can work.
+ */
+ size_t nrrunning_initial;
+
+ TransactionId valid_after;
+
+ /*
+ * Running (sub-)transactions with catalog changes. This will be used to
+ * fill the committed array with a transactions xid and all it subxids
+ * at commit.
+ */
+ HTAB *by_txn;
+
+ /*
+ * Transactions which could have catalog changes that committed between
+ * xmin and xmax
+ */
+ size_t nrcommitted;
+ size_t nrcommitted_space;
+ TransactionId* committed;
+
+ /* contains all catalog modifying txns */
+} Snapstate;
+
+extern Snapstate*
+AllocateSnapshotBuilder(ApplyCache *cache);
+
+extern void
+FreeSnapshotBuilder(Snapstate *cache);
+
+extern SnapBuildAction
+SnapBuildCallback(ApplyCache *cache, Snapstate* snapstate, XLogRecordBuffer* buf);
+
+extern HeapTuple
+LookupTableByRelFileNode(RelFileNode* r);
+
+#endif /* SNAPBUILD_H */
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index ff74f86..6c9b261 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -39,7 +39,8 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
/* This macro encodes the knowledge of which snapshots are MVCC-safe */
#define IsMVCCSnapshot(snapshot) \
- ((snapshot)->satisfies == HeapTupleSatisfiesMVCC)
+ ((snapshot)->satisfies == HeapTupleSatisfiesMVCC || \
+ (snapshot)->satisfies == HeapTupleSatisfiesMVCCDuringDecoding)
/*
* HeapTupleSatisfiesVisibility
@@ -89,4 +90,22 @@ extern bool HeapTupleIsSurelyDead(HeapTupleHeader tuple,
extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid);
+/*
+ * Special "satisfies" routines used during decoding xlog from a different
+ * point of lsn. Also used for timetravel SnapshotNow's.
+ */
+extern bool HeapTupleSatisfiesMVCCDuringDecoding(HeapTupleHeader tuple,
+ Snapshot snapshot, Buffer buffer);
+
+/*
+ * install the 'snapshot_now' snapshot as a timetravelling snapshot replacing
+ * the normal SnapshotNow behaviour. This snapshot needs to have been created
+ * by snapbuild.c otherwise you will see crashes!
+ *
+ * FIXME: We need something resembling the real SnapshotNow to handle things
+ * like enum lookups from indices correctly.
+ */
+extern void SetupDecodingSnapshots(Snapshot snapshot_now);
+extern void RevertFromDecodingSnapshots(void);
+
#endif /* TQUAL_H */
Hi,
A last note:
A git tree of this is at
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-
decoding-rebasing-cf2
checkout with:
git clone --branch xlog-decoding-rebasing-cf2
git://git.postgresql.org/git/users/andresfreund/postgres.git
Webview:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-
decoding-rebasing-cf2
That branch will be regularly rebased to a new master,fixes/new features, and
a pgindent run over the new files...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote:
That branch will be regularly rebased to a new master,fixes/new features,
and a pgindent run over the new files...
I fixed up the formatting of the new stuff (xlogreader, ilist are submitted
separately, no point in doing anything there).
pushed to the repo mentioned upthread.
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Now that I proposed a new syscache upthread its easily possible to provide
pg_relation_by_filenode which I wished for multiple times in the past when
looking at filesystem activity and wondering which table does what. You can
sortof get the same result via
SELECT oid FROM (
SELECT oid, pg_relation_filenode(oid::regclass) filenode
FROM pg_class WHERE relkind != 'v'
) map
WHERE map.filenode = ...;
but thats neither efficient nor obvious.
So, two patches to do this:
Did others need this in the past? I can live with the 2nd patch living in a
private extension somewhere. The first one would also be useful for some
error/debug messages during decoding...
Greetings,
Andres
---
src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
src/include/utils/relmapper.h | 2 ++
2 files changed, 55 insertions(+)
Attachments:
0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchtext/x-patch; name=0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchDownload
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6f21495..771f34d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -180,6 +180,59 @@ RelationMapOidToFilenode(Oid relationId, bool shared)
return InvalidOid;
}
+/* RelationMapFilenodeToOid
+ *
+ * Do the reverse of the normal direction of mapping done in
+ * RelationMapOidToFilenode.
+ *
+ * This is not supposed to be used during normal running but rather for
+ * information purposes when looking at the filesystem or the xlog.
+ *
+ * Returns InvalidOid if the OID is not know which can easily happen if the
+ * filenode is not of a relation that is nailed or shared or if it simply
+ * doesn't exists anywhere.
+ */
+Oid
+RelationMapFilenodeToOid(Oid filenode, bool shared)
+{
+ const RelMapFile *map;
+ int32 i;
+
+ /* If there are active updates, believe those over the main maps */
+ if (shared)
+ {
+ map = &active_shared_updates;
+ for (i = 0; i < map->num_mappings; i++)
+ {
+ if (filenode == map->mappings[i].mapfilenode)
+ return map->mappings[i].mapoid;
+ }
+ map = &shared_map;
+ for (i = 0; i < map->num_mappings; i++)
+ {
+ if (filenode == map->mappings[i].mapfilenode)
+ return map->mappings[i].mapoid;
+ }
+ }
+ else
+ {
+ map = &active_local_updates;
+ for (i = 0; i < map->num_mappings; i++)
+ {
+ if (filenode == map->mappings[i].mapfilenode)
+ return map->mappings[i].mapoid;
+ }
+ map = &local_map;
+ for (i = 0; i < map->num_mappings; i++)
+ {
+ if (filenode == map->mappings[i].mapfilenode)
+ return map->mappings[i].mapoid;
+ }
+ }
+
+ return InvalidOid;
+}
+
/*
* RelationMapUpdateMap
*
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 111a05c..4e56508 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -36,6 +36,8 @@ typedef struct xl_relmap_update
extern Oid RelationMapOidToFilenode(Oid relationId, bool shared);
+extern Oid RelationMapFilenodeToOid(Oid relationId, bool shared);
+
extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared,
bool immediate);
This requires the previously added RELFILENODE syscache.
---
doc/src/sgml/func.sgml | 23 ++++++++++++-
src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++
src/include/catalog/pg_proc.h | 2 ++
src/include/utils/builtins.h | 1 +
4 files changed, 103 insertions(+), 1 deletion(-)
Attachments:
0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchtext/x-patch; name=0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f8f63d8..708da35 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15170,7 +15170,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<para>
The functions shown in <xref linkend="functions-admin-dblocation"> assist
- in identifying the specific disk files associated with database objects.
+ in identifying the specific disk files associated with database objects or doing the reverse.
</para>
<indexterm>
@@ -15179,6 +15179,9 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
<indexterm>
<primary>pg_relation_filepath</primary>
</indexterm>
+ <indexterm>
+ <primary>pg_relation_by_filenode</primary>
+ </indexterm>
<table id="functions-admin-dblocation">
<title>Database Object Location Functions</title>
@@ -15207,6 +15210,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
File path name of the specified relation
</entry>
</row>
+ <row>
+ <entry>
+ <literal><function>pg_relation_by_filenode(<parameter>tablespace</parameter> <type>oid</type>, <parameter>filenode</parameter> <type>oid</type>)</function></literal>
+ </entry>
+ <entry><type>regclass</type></entry>
+ <entry>
+ Find the associated relation of a filenode
+ </entry>
+ </row>
</tbody>
</tgroup>
</table>
@@ -15230,6 +15242,15 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
the relation.
</para>
+ <para>
+ <function>pg_relation_by_filenode</> is the reverse of
+ <function>pg_relation_filenode</>. Given a <quote>tablespace</> OID and
+ a <quote>filenode</> it returns the associated relation. The default
+ tablespace for user tables can be replaced with 0. Check the
+ documentation of <function>pg_relation_filenode</> for an explanation why
+ this cannot always easily answered by querying <structname>pg_class</>.
+ </para>
+
</sect2>
<sect2 id="functions-admin-genfile">
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index cd23334..841a445 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -741,6 +741,84 @@ pg_relation_filenode(PG_FUNCTION_ARGS)
}
/*
+ * Get the relation via (reltablespace, relfilenode)
+ *
+ * This is expected to be used when somebody wants to match an individual file
+ * on the filesystem back to its table. Thats not trivially possible via
+ * pg_class because that doesn't contain the relfilenodes of shared and nailed
+ * tables.
+ *
+ * We don't fail but return NULL if we cannot find a mapping.
+ *
+ * Instead of knowing DEFAULTTABLESPACE_OID you can pass 0.
+ */
+Datum
+pg_relation_by_filenode(PG_FUNCTION_ARGS)
+{
+ Oid reltablespace = PG_GETARG_OID(0);
+ Oid relfilenode = PG_GETARG_OID(1);
+ Oid lookup_tablespace = reltablespace;
+ Oid result = InvalidOid;
+ HeapTuple tuple;
+
+ if (reltablespace == 0)
+ reltablespace = DEFAULTTABLESPACE_OID;
+
+ /* pg_class stores 0 instead of DEFAULTTABLESPACE_OID */
+ if (reltablespace == DEFAULTTABLESPACE_OID)
+ lookup_tablespace = 0;
+
+ tuple = SearchSysCache2(RELFILENODE,
+ lookup_tablespace,
+ relfilenode);
+
+ /* found it in the system catalog, not be a shared/nailed table */
+ if (HeapTupleIsValid(tuple))
+ {
+ result = HeapTupleHeaderGetOid(tuple->t_data);
+ ReleaseSysCache(tuple);
+ }
+ else
+ {
+ if (reltablespace == GLOBALTABLESPACE_OID)
+ {
+ result = RelationMapFilenodeToOid(relfilenode, true);
+ }
+ else
+ {
+ Form_pg_class relform;
+
+ result = RelationMapFilenodeToOid(relfilenode, false);
+
+ if (result != InvalidOid)
+ {
+ /* check that we found the correct relation */
+ tuple = SearchSysCache1(RELOID,
+ result);
+
+ if (!HeapTupleIsValid(tuple))
+ {
+ elog(ERROR, "Couldn't refind previously looked up relation with oid %u",
+ result);
+ }
+
+ relform = (Form_pg_class) GETSTRUCT(tuple);
+
+ if (relform->reltablespace != reltablespace)
+ result = InvalidOid;
+
+ ReleaseSysCache(tuple);
+ }
+ }
+ }
+
+ if (!OidIsValid(result))
+ PG_RETURN_NULL();
+ else
+ PG_RETURN_OID(result);
+}
+
+/*
* Get the pathname (relative to $PGDATA) of a relation
*
* See comments for pg_relation_filenode.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b5b886b..c8233cd 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3430,6 +3430,8 @@ DATA(insert OID = 2998 ( pg_indexes_size PGNSP PGUID 12 1 0 0 0 f f f f t f v 1
DESCR("disk space usage for all indexes attached to the specified table");
DATA(insert OID = 2999 ( pg_relation_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 26 "2205" _null_ _null_ _null_ _null_ pg_relation_filenode _null_ _null_ _null_ ));
DESCR("filenode identifier of relation");
+DATA(insert OID = 3170 ( pg_relation_by_filenode PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 2205 "26 26" _null_ _null_ _null_ _null_ pg_relation_by_filenode _null_ _null_ _null_ ));
+DESCR("filenode identifier of relation");
DATA(insert OID = 3034 ( pg_relation_filepath PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 25 "2205" _null_ _null_ _null_ _null_ pg_relation_filepath _null_ _null_ _null_ ));
DESCR("file path of relation");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c9c665d..8ee4c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -458,6 +458,7 @@ extern Datum pg_table_size(PG_FUNCTION_ARGS);
extern Datum pg_indexes_size(PG_FUNCTION_ARGS);
extern Datum pg_relation_filenode(PG_FUNCTION_ARGS);
extern Datum pg_relation_filepath(PG_FUNCTION_ARGS);
+extern Datum pg_relation_by_filenode(PG_FUNCTION_ARGS);
/* genfile.c */
extern bytea *read_binary_file(const char *filename,
Andres Freund <andres@2ndquadrant.com> writes:
This requires the previously added RELFILENODE syscache.
[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).
regards, tom lane
On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
This requires the previously added RELFILENODE syscache.
[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).
Well, one patch upthread ;). It mentions the problem of it not being unique due
to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed
catalogs.
I am not really sure yet how big a problem for the caching infrastructure it is
that values that shouldn't ever get queried (because the relfilenode is
actually different) are duplicated. Reading code about all that atm.
Robert suggested writing a specialized cache akin to whats done in
attoptcache.c or such.
I haven't formed an opinion on whats the way forward on that topic. But anyway,
I don't see how the wal decoding stuff can progress without some variant of
that mapping, so I sure hope I/we can build something. Changing that aspect of
the patch should be trivial...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 15.09.2012 03:39, Andres Freund wrote:
Features:
- streaming reading/writing
- filtering
- reassembly of recordsReusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.
My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog. 2. We should focus on
reading WAL, I don't see the point of mixing WAL writing into this. 3. I
don't like the callback-style API.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.
There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.
So the usage is like this:
/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}
state = XLogReaderAllocate(&MyXLogPageRead);
while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}
XLogReaderFree(state);
- Heikki
Attachments:
xlogreader-heikki-1.patchtext/x-diff; name=xlogreader-heikki-1.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
- twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+ twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..769ddea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -541,6 +542,8 @@ static uint32 readOff = 0;
static uint32 readLen = 0;
static int readSource = 0; /* XLOG_FROM_* code */
+static bool fetching_ckpt_global;
+
/*
* Keeps track of which sources we've tried to read the current WAL
* record from and failed.
@@ -556,13 +559,6 @@ static int failedSources = 0; /* OR of XLOG_FROM_* codes */
static TimestampTz XLogReceiptTime = 0;
static int XLogReceiptSource = 0; /* XLOG_FROM_* code */
-/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
-static char *readBuf = NULL;
-
-/* Buffer for current ReadRecord result (expandable) */
-static char *readRecordBuf = NULL;
-static uint32 readRecordBufSize = 0;
-
/* State information for XLOG reading */
static XLogRecPtr ReadRecPtr; /* start of last record read */
static XLogRecPtr EndRecPtr; /* end+1 of last record read */
@@ -632,9 +628,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
int source, bool notexistOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
-static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
- bool randAccess);
-static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ int emode, bool randAccess, char *reaBuf, void *private_data);
static void XLogFileClose(void);
static bool RestoreArchivedFile(char *path, const char *xlogfname,
const char *recovername, off_t expectedSize);
@@ -646,12 +641,10 @@ static void UpdateLastRemovedPtr(char *filename);
static void ValidateXLOGDirectoryStructure(void);
static void CleanupBackupHistory(void);
static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
-static void CheckRecoveryConsistency(void);
+static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt);
+static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
-static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
- int emode, bool randAccess);
-static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
+static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt);
static List *readTimeLineHistory(TimeLineID targetTLI);
static bool existsTimeLineHistory(TimeLineID probeTLI);
static bool rescanLatestTimeLine(void);
@@ -3703,102 +3696,6 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
}
/*
- * CRC-check an XLOG record. We do not believe the contents of an XLOG
- * record (other than to the minimal extent of computing the amount of
- * data to read in) until we've checked the CRCs.
- *
- * We assume all of the record (that is, xl_tot_len bytes) has been read
- * into memory at *record. Also, ValidXLogRecordHeader() has accepted the
- * record's header, which means in particular that xl_tot_len is at least
- * SizeOfXlogRecord, so it is safe to fetch xl_len.
- */
-static bool
-RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
-{
- pg_crc32 crc;
- int i;
- uint32 len = record->xl_len;
- BkpBlock bkpb;
- char *blk;
- size_t remaining = record->xl_tot_len;
-
- /* First the rmgr data */
- if (remaining < SizeOfXLogRecord + len)
- {
- /* ValidXLogRecordHeader() should've caught this already... */
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid record length at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- remaining -= SizeOfXLogRecord + len;
- INIT_CRC32(crc);
- COMP_CRC32(crc, XLogRecGetData(record), len);
-
- /* Add in the backup blocks, if any */
- blk = (char *) XLogRecGetData(record) + len;
- for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- {
- uint32 blen;
-
- if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
- continue;
-
- if (remaining < sizeof(BkpBlock))
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid backup block size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- memcpy(&bkpb, blk, sizeof(BkpBlock));
-
- if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect hole size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
-
- if (remaining < blen)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid backup block size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- remaining -= blen;
- COMP_CRC32(crc, blk, blen);
- blk += blen;
- }
-
- /* Check that xl_tot_len agrees with our calculation */
- if (remaining != 0)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect total length in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
-
- /* Finally include the record header */
- COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
- FIN_CRC32(crc);
-
- if (!EQ_CRC32(record->xl_crc, crc))
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect resource manager data checksum in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
-
- return true;
-}
-
-/*
* Attempt to read an XLOG record.
*
* If RecPtr is not NULL, try to read a record at that position. Otherwise
@@ -3811,290 +3708,35 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
* the returned record pointer always points there.
*/
static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt)
{
XLogRecord *record;
- XLogRecPtr tmpRecPtr = EndRecPtr;
- bool randAccess = false;
- uint32 len,
- total_len;
- uint32 targetRecOff;
- uint32 pageHeaderSize;
- bool gotheader;
-
- if (readBuf == NULL)
- {
- /*
- * First time through, permanently allocate readBuf. We do it this
- * way, rather than just making a static array, for two reasons: (1)
- * no need to waste the storage in most instantiations of the backend;
- * (2) a static char array isn't guaranteed to have any particular
- * alignment, whereas malloc() will provide MAXALIGN'd storage.
- */
- readBuf = (char *) malloc(XLOG_BLCKSZ);
- Assert(readBuf != NULL);
- }
-
- if (RecPtr == NULL)
- {
- RecPtr = &tmpRecPtr;
- /*
- * RecPtr is pointing to end+1 of the previous WAL record. If
- * we're at a page boundary, no more records can fit on the current
- * page. We must skip over the page header, but we can't do that
- * until we've read in the page, since the header size is variable.
- */
- }
- else
- {
- /*
- * In this case, the passed-in record pointer should already be
- * pointing to a valid record starting position.
- */
- if (!XRecOffIsValid(*RecPtr))
- ereport(PANIC,
- (errmsg("invalid record offset at %X/%X",
- (uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
-
- /*
- * Since we are going to a random position in WAL, forget any prior
- * state about what timeline we were in, and allow it to be any
- * timeline in expectedTLIs. We also set a flag to allow curFileTLI
- * to go backwards (but we can't reset that variable right here, since
- * we might not change files at all).
- */
+ if (!XLogRecPtrIsInvalid(RecPtr))
lastPageTLI = 0; /* see comment in ValidXLogPageHeader */
- randAccess = true; /* allow curFileTLI to go backwards too */
- }
+
+ fetching_ckpt_global = fetching_ckpt;
/* This is the first try to read this page. */
failedSources = 0;
-retry:
- /* Read the page containing the record */
- if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
- return NULL;
-
- pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
- targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
- if (targetRecOff == 0)
+ do
{
- /*
- * At page start, so skip over page header. The Assert checks that
- * we're not scribbling on caller's record pointer; it's OK because we
- * can only get here in the continuing-from-prev-record case, since
- * XRecOffIsValid rejected the zero-page-offset case otherwise.
- */
- Assert(RecPtr == &tmpRecPtr);
- (*RecPtr) += pageHeaderSize;
- targetRecOff = pageHeaderSize;
- }
- else if (targetRecOff < pageHeaderSize)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record offset at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
- targetRecOff == pageHeaderSize)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("contrecord is requested by %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
-
- /*
- * Read the record length.
- *
- * NB: Even though we use an XLogRecord pointer here, the whole record
- * header might not fit on this page. xl_tot_len is the first field of
- * the struct, so it must be on this page (the records are MAXALIGNed),
- * but we cannot access any other fields until we've verified that we
- * got the whole header.
- */
- record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
- total_len = record->xl_tot_len;
-
- /*
- * If the whole record header is on this page, validate it immediately.
- * Otherwise do just a basic sanity check on xl_tot_len, and validate the
- * rest of the header after reading it from the next page. The xl_tot_len
- * check is necessary here to ensure that we enter the "Need to reassemble
- * record" code path below; otherwise we might fail to apply
- * ValidXLogRecordHeader at all.
- */
- if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
- {
- if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
- goto next_record_is_invalid;
- gotheader = true;
- }
- else
- {
- if (total_len < SizeOfXLogRecord)
+ record = XLogReadRecord(xlogreader, RecPtr, emode);
+ ReadRecPtr = xlogreader->ReadRecPtr;
+ EndRecPtr = xlogreader->EndRecPtr;
+ if (record == NULL)
{
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- gotheader = false;
- }
-
- /*
- * Allocate or enlarge readRecordBuf as needed. To avoid useless small
- * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
- * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with. (That is
- * enough for all "normal" records, but very large commit or abort records
- * might need more space.)
- */
- if (total_len > readRecordBufSize)
- {
- uint32 newSize = total_len;
+ failedSources |= readSource;
- newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
- newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
- if (readRecordBuf)
- free(readRecordBuf);
- readRecordBuf = (char *) malloc(newSize);
- if (!readRecordBuf)
- {
- readRecordBufSize = 0;
- /* We treat this as a "bogus data" condition */
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record length %u at %X/%X too long",
- total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- readRecordBufSize = newSize;
- }
-
- len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
- if (total_len > len)
- {
- /* Need to reassemble record */
- char *contrecord;
- XLogPageHeader pageHeader;
- XLogRecPtr pagelsn;
- char *buffer;
- uint32 gotlen;
-
- /* Initialize pagelsn to the beginning of the page this record is on */
- pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-
- /* Copy the first fragment of the record from the first page. */
- memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
- buffer = readRecordBuf + len;
- gotlen = len;
-
- do
- {
- /* Calculate pointer to beginning of next page */
- XLByteAdvance(pagelsn, XLOG_BLCKSZ);
- /* Wait for the next page to become available */
- if (!XLogPageRead(&pagelsn, emode, false, false))
- return NULL;
-
- /* Check that the continuation on next page looks valid */
- pageHeader = (XLogPageHeader) readBuf;
- if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("there is no contrecord flag in log segment %s, offset %u",
- XLogFileNameP(curFileTLI, readSegNo),
- readOff)));
- goto next_record_is_invalid;
- }
- /*
- * Cross-check that xlp_rem_len agrees with how much of the record
- * we expect there to be left.
- */
- if (pageHeader->xlp_rem_len == 0 ||
- total_len != (pageHeader->xlp_rem_len + gotlen))
+ if (readFile >= 0)
{
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid contrecord length %u in log segment %s, offset %u",
- pageHeader->xlp_rem_len,
- XLogFileNameP(curFileTLI, readSegNo),
- readOff)));
- goto next_record_is_invalid;
+ close(readFile);
+ readFile = -1;
}
+ }
+ } while(StandbyMode && record == NULL);
- /* Append the continuation from this page to the buffer */
- pageHeaderSize = XLogPageHeaderSize(pageHeader);
- contrecord = (char *) readBuf + pageHeaderSize;
- len = XLOG_BLCKSZ - pageHeaderSize;
- if (pageHeader->xlp_rem_len < len)
- len = pageHeader->xlp_rem_len;
- memcpy(buffer, (char *) contrecord, len);
- buffer += len;
- gotlen += len;
-
- /* If we just reassembled the record header, validate it. */
- if (!gotheader)
- {
- record = (XLogRecord *) readRecordBuf;
- if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
- goto next_record_is_invalid;
- gotheader = true;
- }
- } while (pageHeader->xlp_rem_len > len);
-
- record = (XLogRecord *) readRecordBuf;
- if (!RecordIsValid(record, *RecPtr, emode))
- goto next_record_is_invalid;
- pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
- XLogSegNoOffsetToRecPtr(
- readSegNo,
- readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
- EndRecPtr);
- ReadRecPtr = *RecPtr;
- }
- else
- {
- /* Record does not cross a page boundary */
- if (!RecordIsValid(record, *RecPtr, emode))
- goto next_record_is_invalid;
- EndRecPtr = *RecPtr + MAXALIGN(total_len);
-
- ReadRecPtr = *RecPtr;
- memcpy(readRecordBuf, record, total_len);
- }
-
- /*
- * Special processing if it's an XLOG SWITCH record
- */
- if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- {
- /* Pretend it extends to end of segment */
- EndRecPtr += XLogSegSize - 1;
- EndRecPtr -= EndRecPtr % XLogSegSize;
-
- /*
- * Pretend that readBuf contains the last page of the segment. This is
- * just to avoid Assert failure in StartupXLOG if XLOG ends with this
- * segment.
- */
- readOff = XLogSegSize - XLOG_BLCKSZ;
- }
return record;
-
-next_record_is_invalid:
- failedSources |= readSource;
-
- if (readFile >= 0)
- {
- close(readFile);
- readFile = -1;
- }
-
- /* In standby-mode, keep trying */
- if (StandbyMode)
- goto retry;
- else
- return NULL;
}
/*
@@ -4223,88 +3865,6 @@ ValidXLogPageHeader(XLogPageHeader hdr, int emode)
}
/*
- * Validate an XLOG record header.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord. It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
- bool randAccess)
-{
- /*
- * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
- * required.
- */
- if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- {
- if (record->xl_len != 0)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid xlog switch record at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
- else if (record->xl_len == 0)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with zero length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
- record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
- XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (record->xl_rmid > RM_MAX_ID)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid resource manager ID %u at %X/%X",
- record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (randAccess)
- {
- /*
- * We can't exactly verify the prev-link, but surely it should be less
- * than the record's own address.
- */
- if (!XLByteLT(record->xl_prev, *RecPtr))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with incorrect prev-link %X/%X at %X/%X",
- (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
- else
- {
- /*
- * Record's prev-link should exactly match our previous location. This
- * check guards against torn WAL pages where a stale but valid-looking
- * WAL record starts on a sector boundary.
- */
- if (!XLByteEQ(record->xl_prev, ReadRecPtr))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with incorrect prev-link %X/%X at %X/%X",
- (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
-
- return true;
-}
-
-/*
* Try to read a timeline's history file.
*
* If successful, return the list of component TLIs (the given TLI followed by
@@ -6089,6 +5649,7 @@ StartupXLOG(void)
bool backupEndRequired = false;
bool backupFromStandby = false;
DBState dbstate_at_startup;
+ XLogReaderState *xlogreader;
/*
* Read control file and check XLOG status looks valid.
@@ -6222,6 +5783,8 @@ StartupXLOG(void)
if (StandbyMode)
OwnLatch(&XLogCtl->recoveryWakeupLatch);
+ xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, NULL);
+
if (read_backup_label(&checkPointLoc, &backupEndRequired,
&backupFromStandby))
{
@@ -6229,7 +5792,7 @@ StartupXLOG(void)
* When a backup_label file is present, we want to roll forward from
* the checkpoint it identifies, rather than using pg_control.
*/
- record = ReadCheckpointRecord(checkPointLoc, 0);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
if (record != NULL)
{
memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
@@ -6247,7 +5810,7 @@ StartupXLOG(void)
*/
if (XLByteLT(checkPoint.redo, checkPointLoc))
{
- if (!ReadRecord(&(checkPoint.redo), LOG, false))
+ if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
@@ -6271,7 +5834,7 @@ StartupXLOG(void)
*/
checkPointLoc = ControlFile->checkPoint;
RedoStartLSN = ControlFile->checkPointCopy.redo;
- record = ReadCheckpointRecord(checkPointLoc, 1);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
if (record != NULL)
{
ereport(DEBUG1,
@@ -6290,7 +5853,7 @@ StartupXLOG(void)
else
{
checkPointLoc = ControlFile->prevCheckPoint;
- record = ReadCheckpointRecord(checkPointLoc, 2);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
if (record != NULL)
{
ereport(LOG,
@@ -6591,7 +6154,7 @@ StartupXLOG(void)
* Allow read-only connections immediately if we're consistent
* already.
*/
- CheckRecoveryConsistency();
+ CheckRecoveryConsistency(EndRecPtr);
/*
* Find the first record that logically follows the checkpoint --- it
@@ -6600,12 +6163,12 @@ StartupXLOG(void)
if (XLByteLT(checkPoint.redo, RecPtr))
{
/* back up to find the record */
- record = ReadRecord(&(checkPoint.redo), PANIC, false);
+ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
}
else
{
/* just have to read next record after CheckPoint */
- record = ReadRecord(NULL, LOG, false);
+ record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
}
if (record != NULL)
@@ -6652,7 +6215,7 @@ StartupXLOG(void)
HandleStartupProcInterrupts();
/* Allow read-only connections if we're consistent now */
- CheckRecoveryConsistency();
+ CheckRecoveryConsistency(EndRecPtr);
/*
* Have we reached our recovery target?
@@ -6756,7 +6319,7 @@ StartupXLOG(void)
LastRec = ReadRecPtr;
- record = ReadRecord(NULL, LOG, false);
+ record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
} while (record != NULL && recoveryContinue);
/*
@@ -6806,7 +6369,7 @@ StartupXLOG(void)
* Re-fetch the last valid or last applied record, so we can identify the
* exact endpoint of what we consider the valid portion of WAL.
*/
- record = ReadRecord(&LastRec, PANIC, false);
+ record = ReadRecord(xlogreader, LastRec, PANIC, false);
EndOfLog = EndRecPtr;
XLByteToPrevSeg(EndOfLog, endLogSegNo);
@@ -6905,8 +6468,15 @@ StartupXLOG(void)
* record spans, not the one it starts in. The last block is indeed the
* one we want to use.
*/
- Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
- memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
+ if (EndOfLog % XLOG_BLCKSZ == 0)
+ {
+ memset(Insert->currpage, 0, XLOG_BLCKSZ);
+ }
+ else
+ {
+ Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
+ memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
+ }
Insert->currpos = (char *) Insert->currpage +
(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
@@ -7063,17 +6633,7 @@ StartupXLOG(void)
close(readFile);
readFile = -1;
}
- if (readBuf)
- {
- free(readBuf);
- readBuf = NULL;
- }
- if (readRecordBuf)
- {
- free(readRecordBuf);
- readRecordBuf = NULL;
- readRecordBufSize = 0;
- }
+ XLogReaderFree(xlogreader);
/*
* If any of the critical GUCs have changed, log them before we allow
@@ -7104,7 +6664,7 @@ StartupXLOG(void)
* that it can start accepting read-only connections.
*/
static void
-CheckRecoveryConsistency(void)
+CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
{
/*
* During crash recovery, we don't reach a consistent state until we've
@@ -7284,7 +6844,7 @@ LocalSetXLogInsertAllowed(void)
* 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
*/
static XLogRecord *
-ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
+ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
{
XLogRecord *record;
@@ -7308,7 +6868,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
return NULL;
}
- record = ReadRecord(&RecPtr, LOG, true);
+ record = ReadRecord(xlogreader, RecPtr, LOG, true);
if (record == NULL)
{
@@ -10100,19 +9660,21 @@ CancelBackup(void)
* sleep and retry.
*/
static bool
-XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
- bool randAccess)
+XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
+ bool randAccess, char *readBuf, void *private_data)
{
+ /* TODO: these, and fetching_ckpt, would be better in private_data */
static XLogRecPtr receivedUpto = 0;
+ static pg_time_t last_fail_time = 0;
+ bool fetching_ckpt = fetching_ckpt_global;
bool switched_segment = false;
uint32 targetPageOff;
uint32 targetRecOff;
XLogSegNo targetSegNo;
- static pg_time_t last_fail_time = 0;
- XLByteToSeg(*RecPtr, targetSegNo);
- targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
- targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
+ XLByteToSeg(RecPtr, targetSegNo);
+ targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+ targetRecOff = RecPtr % XLOG_BLCKSZ;
/* Fast exit if we have read the record in the current buffer already */
if (failedSources == 0 && targetSegNo == readSegNo &&
@@ -10123,7 +9685,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
* See if we need to switch to a new segment because the requested record
* is not in the currently open one.
*/
- if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
+ if (readFile >= 0 && !XLByteInSeg(RecPtr, readSegNo))
{
/*
* Request a restartpoint if we've replayed too much xlog since the
@@ -10144,12 +9706,12 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
readSource = 0;
}
- XLByteToSeg(*RecPtr, readSegNo);
+ XLByteToSeg(RecPtr, readSegNo);
retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
- (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
+ (readSource == XLOG_FROM_STREAM && !XLByteLT(RecPtr, receivedUpto)))
{
if (StandbyMode)
{
@@ -10192,17 +9754,17 @@ retry:
* XLogReceiptTime will not advance, so the grace time
* alloted to conflicting queries will decrease.
*/
- if (XLByteLT(*RecPtr, receivedUpto))
+ if (XLByteLT(RecPtr, receivedUpto))
havedata = true;
else
{
XLogRecPtr latestChunkStart;
receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart);
- if (XLByteLT(*RecPtr, receivedUpto))
+ if (XLByteLT(RecPtr, receivedUpto))
{
havedata = true;
- if (!XLByteLT(*RecPtr, latestChunkStart))
+ if (!XLByteLT(RecPtr, latestChunkStart))
{
XLogReceiptTime = GetCurrentTimestamp();
SetCurrentChunkStartTime(XLogReceiptTime);
@@ -10321,7 +9883,7 @@ retry:
if (PrimaryConnInfo)
{
RequestXLogStreaming(
- fetching_ckpt ? RedoStartLSN : *RecPtr,
+ fetching_ckpt ? RedoStartLSN : RecPtr,
PrimaryConnInfo);
continue;
}
@@ -10393,7 +9955,7 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
{
readLen = XLOG_BLCKSZ;
}
@@ -10417,7 +9979,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not read from log segment %s, offset %u: %m",
fname, readOff)));
@@ -10433,7 +9995,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not seek in log segment %s to offset %u: %m",
fname, readOff)));
@@ -10443,7 +10005,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not read from log segment %s, offset %u: %m",
fname, readOff)));
@@ -10501,7 +10063,7 @@ triggered:
* you are about to ereport(), or you might cause a later message to be
* erroneously suppressed.
*/
-static int
+int
emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
{
static XLogRecPtr lastComplaint = 0;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..8ba05b1
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,496 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ * Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogreader.c
+ *
+ * NOTES
+ * Documentation about how do use this interface can be found in
+ * xlogreader.h, more specifically in the definition of the
+ * XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "catalog/pg_control.h"
+
+static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ XLogRecord *record, int emode, bool randAccess);
+static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+
+/*
+ * Initialize a new xlog reader
+ */
+XLogReaderState *
+XLogReaderAllocate(XLogRecPtr startpoint,
+ XLogPageReadCB pagereadfunc, void *private_data)
+{
+ XLogReaderState *state;
+
+ state = (XLogReaderState *) palloc0(sizeof(XLogReaderState));
+
+ /*
+ * First time through, permanently allocate readBuf. We do it this
+ * way, rather than just making a static array, for two reasons: (1)
+ * no need to waste the storage in most instantiations of the backend;
+ * (2) a static char array isn't guaranteed to have any particular
+ * alignment, whereas malloc() will provide MAXALIGN'd storage.
+ */
+ state->readBuf = (char *) palloc(XLOG_BLCKSZ);
+
+ state->read_page = pagereadfunc;
+ state->private_data = private_data;
+ state->EndRecPtr = startpoint;
+
+ return state;
+}
+
+void
+XLogReaderFree(XLogReaderState *state)
+{
+ if (state->readRecordBuf)
+ pfree(state->readRecordBuf);
+ pfree(state->readBuf);
+ pfree(state);
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * If RecPtr is not NULL, try to read a record at that position. Otherwise
+ * try to read a record just after the last one previously read.
+ *
+ * If no valid record is available, returns NULL, or fails if emode is PANIC.
+ * (emode must be either PANIC, LOG)
+ *
+ * The record is copied into readRecordBuf, so that on successful return,
+ * the returned record pointer always points there.
+ */
+XLogRecord *
+XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+{
+ XLogRecord *record;
+ XLogRecPtr tmpRecPtr = state->EndRecPtr;
+ bool randAccess = false;
+ uint32 len,
+ total_len;
+ uint32 targetRecOff;
+ uint32 pageHeaderSize;
+ bool gotheader;
+
+ if (RecPtr == InvalidXLogRecPtr)
+ {
+ RecPtr = tmpRecPtr;
+
+ /*
+ * RecPtr is pointing to end+1 of the previous WAL record. If
+ * we're at a page boundary, no more records can fit on the current
+ * page. We must skip over the page header, but we can't do that
+ * until we've read in the page, since the header size is variable.
+ */
+ }
+ else
+ {
+ /*
+ * In this case, the passed-in record pointer should already be
+ * pointing to a valid record starting position.
+ */
+ if (!XRecOffIsValid(RecPtr))
+ ereport(PANIC,
+ (errmsg("invalid record offset at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ randAccess = true; /* allow curFileTLI to go backwards too */
+ }
+
+ /* Read the page containing the record */
+ if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf, state->private_data))
+ return NULL;
+
+ pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ targetRecOff = RecPtr % XLOG_BLCKSZ;
+ if (targetRecOff == 0)
+ {
+ /*
+ * At page start, so skip over page header. The Assert checks that
+ * we're not scribbling on caller's record pointer; it's OK because we
+ * can only get here in the continuing-from-prev-record case, since
+ * XRecOffIsValid rejected the zero-page-offset case otherwise.
+ * XXX: does this assert make sense now that RecPtr is not a pointer?
+ */
+ Assert(RecPtr == tmpRecPtr);
+ RecPtr += pageHeaderSize;
+ targetRecOff = pageHeaderSize;
+ }
+ else if (targetRecOff < pageHeaderSize)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record offset at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+ targetRecOff == pageHeaderSize)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("contrecord is requested by %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+
+ /*
+ * Read the record length.
+ *
+ * NB: Even though we use an XLogRecord pointer here, the whole record
+ * header might not fit on this page. xl_tot_len is the first field of
+ * the struct, so it must be on this page (the records are MAXALIGNed),
+ * but we cannot access any other fields until we've verified that we
+ * got the whole header.
+ */
+ record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+ total_len = record->xl_tot_len;
+
+ /*
+ * If the whole record header is on this page, validate it immediately.
+ * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+ * rest of the header after reading it from the next page. The xl_tot_len
+ * check is necessary here to ensure that we enter the "Need to reassemble
+ * record" code path below; otherwise we might fail to apply
+ * ValidXLogRecordHeader at all.
+ */
+ if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+ {
+ if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+ goto next_record_is_invalid;
+ gotheader = true;
+ }
+ else
+ {
+ if (total_len < SizeOfXLogRecord)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ gotheader = false;
+ }
+
+ /*
+ * Allocate or enlarge readRecordBuf as needed. To avoid useless small
+ * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
+ * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with. (That is
+ * enough for all "normal" records, but very large commit or abort records
+ * might need more space.)
+ */
+ if (total_len > state->readRecordBufSize)
+ {
+ uint32 newSize = total_len;
+
+ newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+ newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+ if (state->readRecordBuf)
+ pfree(state->readRecordBuf);
+ state->readRecordBuf = (char *) palloc(newSize);
+ if (!state->readRecordBuf)
+ {
+ state->readRecordBufSize = 0;
+ /* We treat this as a "bogus data" condition */
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record length %u at %X/%X too long",
+ total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ state->readRecordBufSize = newSize;
+ }
+
+ len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+ if (total_len > len)
+ {
+ /* Need to reassemble record */
+ char *contrecord;
+ XLogPageHeader pageHeader;
+ XLogRecPtr pagelsn;
+ char *buffer;
+ uint32 gotlen;
+
+ /* Initialize pagelsn to the beginning of the page this record is on */
+ pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+
+ /* Copy the first fragment of the record from the first page. */
+ memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+ buffer = state->readRecordBuf + len;
+ gotlen = len;
+
+ do
+ {
+ /* Calculate pointer to beginning of next page */
+ XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+ /* Wait for the next page to become available */
+ if (!state->read_page(state, pagelsn, emode, false, state->readBuf, NULL))
+ return NULL;
+
+ /* Check that the continuation on next page looks valid */
+ pageHeader = (XLogPageHeader) state->readBuf;
+ if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("there is no contrecord flag at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ /*
+ * Cross-check that xlp_rem_len agrees with how much of the record
+ * we expect there to be left.
+ */
+ if (pageHeader->xlp_rem_len == 0 ||
+ total_len != (pageHeader->xlp_rem_len + gotlen))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid contrecord length %u at %X/%X",
+ pageHeader->xlp_rem_len,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+
+ /* Append the continuation from this page to the buffer */
+ pageHeaderSize = XLogPageHeaderSize(pageHeader);
+ contrecord = (char *) state->readBuf + pageHeaderSize;
+ len = XLOG_BLCKSZ - pageHeaderSize;
+ if (pageHeader->xlp_rem_len < len)
+ len = pageHeader->xlp_rem_len;
+ memcpy(buffer, (char *) contrecord, len);
+ buffer += len;
+ gotlen += len;
+
+ /* If we just reassembled the record header, validate it. */
+ if (!gotheader)
+ {
+ record = (XLogRecord *) state->readRecordBuf;
+ if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+ goto next_record_is_invalid;
+ gotheader = true;
+ }
+ } while (pageHeader->xlp_rem_len > len);
+
+ record = (XLogRecord *) state->readRecordBuf;
+ if (!RecordIsValid(record, RecPtr, emode))
+ goto next_record_is_invalid;
+ pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ state->ReadRecPtr = RecPtr;
+ state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+ }
+ else
+ {
+ /* Record does not cross a page boundary */
+ if (!RecordIsValid(record, RecPtr, emode))
+ goto next_record_is_invalid;
+ state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+
+ state->ReadRecPtr = RecPtr;
+ memcpy(state->readRecordBuf, record, total_len);
+ }
+
+ /*
+ * Special processing if it's an XLOG SWITCH record
+ */
+ if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ {
+ /* Pretend it extends to end of segment */
+ state->EndRecPtr += XLogSegSize - 1;
+ state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+ }
+ return record;
+
+next_record_is_invalid:
+ return NULL;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * ReadRecord. It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, int emode,
+ bool randAccess)
+{
+ /*
+ * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+ * required.
+ */
+ if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ {
+ if (record->xl_len != 0)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid xlog switch record at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+ else if (record->xl_len == 0)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with zero length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+ record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+ XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (record->xl_rmid > RM_MAX_ID)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid resource manager ID %u at %X/%X",
+ record->xl_rmid, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (randAccess)
+ {
+ /*
+ * We can't exactly verify the prev-link, but surely it should be less
+ * than the record's own address.
+ */
+ if (!XLByteLT(record->xl_prev, RecPtr))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Record's prev-link should exactly match our previous location. This
+ * check guards against torn WAL pages where a stale but valid-looking
+ * WAL record starts on a sector boundary.
+ */
+ if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+
+ return true;
+}
+
+
+/*
+ * CRC-check an XLOG record. We do not believe the contents of an XLOG
+ * record (other than to the minimal extent of computing the amount of
+ * data to read in) until we've checked the CRCs.
+ *
+ * We assume all of the record (that is, xl_tot_len bytes) has been read
+ * into memory at *record. Also, ValidXLogRecordHeader() has accepted the
+ * record's header, which means in particular that xl_tot_len is at least
+ * SizeOfXlogRecord, so it is safe to fetch xl_len.
+ */
+static bool
+RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+{
+ pg_crc32 crc;
+ int i;
+ uint32 len = record->xl_len;
+ BkpBlock bkpb;
+ char *blk;
+ size_t remaining = record->xl_tot_len;
+
+ /* First the rmgr data */
+ if (remaining < SizeOfXLogRecord + len)
+ {
+ /* ValidXLogRecordHeader() should've caught this already... */
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ remaining -= SizeOfXLogRecord + len;
+ INIT_CRC32(crc);
+ COMP_CRC32(crc, XLogRecGetData(record), len);
+
+ /* Add in the backup blocks, if any */
+ blk = (char *) XLogRecGetData(record) + len;
+ for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ {
+ uint32 blen;
+
+ if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+ continue;
+
+ if (remaining < sizeof(BkpBlock))
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid backup block size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ memcpy(&bkpb, blk, sizeof(BkpBlock));
+
+ if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect hole size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+
+ if (remaining < blen)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid backup block size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ remaining -= blen;
+ COMP_CRC32(crc, blk, blen);
+ blk += blen;
+ }
+
+ /* Check that xl_tot_len agrees with our calculation */
+ if (remaining != 0)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect total length in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+
+ /* Finally include the record header */
+ COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32(crc);
+
+ if (!EQ_CRC32(record->xl_crc, crc))
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect resource manager data checksum in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index b5bfb7b..1ada664 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -229,6 +229,14 @@ extern const RmgrData RmgrTable[];
extern pg_time_t GetLastSegSwitchTime(void);
extern XLogRecPtr RequestXLogSwitch(void);
+
+/*
+ * Exported so that xlogreader.c can call this. TODO: Should be refactored
+ * into a callback, or just have xlogreader return the error string and have
+ * the caller of XLogReadRecord() do the ereport() call.
+ */
+extern int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+
/*
* These aren't in xlog.h because I'd rather not include fmgr.h there.
*/
Hi Heikki,
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
Features:
- streaming reading/writing
- filtering
- reassembly of recordsReusing the ReadRecord infrastructure in situations where the code that
wants to do so is not tightly integrated into xlog.c is rather hard and
would require changes to rather integral parts of the recovery code
which doesn't seem to be a good idea.My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog.
Yes. I aggree. And I am willing to provide an implementation of this if should
my xlogreader variant gets a bit more buyin.
2. We should focus on reading WAL, I don't see the point of mixing WAL
writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.
Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.
3. I don't like the callback-style API.
I tried to accomodate to that by providing:
extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);
which does exactly that.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.
The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.
There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.
Makes sense.
So the usage is like this:
/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}state = XLogReaderAllocate(&MyXLogPageRead);
while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}
If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...
Do my aims make any sense to you?
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WALwriting into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.
It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.
Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.
Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.
You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated. I'd suggest keeping
it simple for now, and optimize later if necessary. Note that before you
have the whole WAL record, you cannot CRC check it, so you don't know if
it's in fact a valid WAL record.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.
I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.
If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...
Ah sorry, patch with xlogreader.h attached.
- Heikki
Attachments:
xlogreader-heikki-2.patchtext/x-diff; name=xlogreader-heikki-2.patchDownload
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index f82f10e..660b5fc 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clog.o transam.o varsup.o xact.o rmgr.o slru.o subtrans.o multixact.o \
- twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogutils.o
+ twophase.o twophase_rmgr.o xlog.o xlogfuncs.o xlogreader.o xlogutils.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff56c26..769ddea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
#include "access/twophase.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
#include "access/xlogutils.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -541,6 +542,8 @@ static uint32 readOff = 0;
static uint32 readLen = 0;
static int readSource = 0; /* XLOG_FROM_* code */
+static bool fetching_ckpt_global;
+
/*
* Keeps track of which sources we've tried to read the current WAL
* record from and failed.
@@ -556,13 +559,6 @@ static int failedSources = 0; /* OR of XLOG_FROM_* codes */
static TimestampTz XLogReceiptTime = 0;
static int XLogReceiptSource = 0; /* XLOG_FROM_* code */
-/* Buffer for currently read page (XLOG_BLCKSZ bytes) */
-static char *readBuf = NULL;
-
-/* Buffer for current ReadRecord result (expandable) */
-static char *readRecordBuf = NULL;
-static uint32 readRecordBufSize = 0;
-
/* State information for XLOG reading */
static XLogRecPtr ReadRecPtr; /* start of last record read */
static XLogRecPtr EndRecPtr; /* end+1 of last record read */
@@ -632,9 +628,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
int source, bool notexistOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
-static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
- bool randAccess);
-static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
+ int emode, bool randAccess, char *reaBuf, void *private_data);
static void XLogFileClose(void);
static bool RestoreArchivedFile(char *path, const char *xlogfname,
const char *recovername, off_t expectedSize);
@@ -646,12 +641,10 @@ static void UpdateLastRemovedPtr(char *filename);
static void ValidateXLOGDirectoryStructure(void);
static void CleanupBackupHistory(void);
static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
-static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
-static void CheckRecoveryConsistency(void);
+static XLogRecord *ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt);
+static void CheckRecoveryConsistency(XLogRecPtr EndRecPtr);
static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
-static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
- int emode, bool randAccess);
-static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
+static XLogRecord *ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt);
static List *readTimeLineHistory(TimeLineID targetTLI);
static bool existsTimeLineHistory(TimeLineID probeTLI);
static bool rescanLatestTimeLine(void);
@@ -3703,102 +3696,6 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
}
/*
- * CRC-check an XLOG record. We do not believe the contents of an XLOG
- * record (other than to the minimal extent of computing the amount of
- * data to read in) until we've checked the CRCs.
- *
- * We assume all of the record (that is, xl_tot_len bytes) has been read
- * into memory at *record. Also, ValidXLogRecordHeader() has accepted the
- * record's header, which means in particular that xl_tot_len is at least
- * SizeOfXlogRecord, so it is safe to fetch xl_len.
- */
-static bool
-RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
-{
- pg_crc32 crc;
- int i;
- uint32 len = record->xl_len;
- BkpBlock bkpb;
- char *blk;
- size_t remaining = record->xl_tot_len;
-
- /* First the rmgr data */
- if (remaining < SizeOfXLogRecord + len)
- {
- /* ValidXLogRecordHeader() should've caught this already... */
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid record length at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- remaining -= SizeOfXLogRecord + len;
- INIT_CRC32(crc);
- COMP_CRC32(crc, XLogRecGetData(record), len);
-
- /* Add in the backup blocks, if any */
- blk = (char *) XLogRecGetData(record) + len;
- for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
- {
- uint32 blen;
-
- if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
- continue;
-
- if (remaining < sizeof(BkpBlock))
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid backup block size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- memcpy(&bkpb, blk, sizeof(BkpBlock));
-
- if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect hole size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
-
- if (remaining < blen)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("invalid backup block size in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
- remaining -= blen;
- COMP_CRC32(crc, blk, blen);
- blk += blen;
- }
-
- /* Check that xl_tot_len agrees with our calculation */
- if (remaining != 0)
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect total length in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
-
- /* Finally include the record header */
- COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
- FIN_CRC32(crc);
-
- if (!EQ_CRC32(record->xl_crc, crc))
- {
- ereport(emode_for_corrupt_record(emode, recptr),
- (errmsg("incorrect resource manager data checksum in record at %X/%X",
- (uint32) (recptr >> 32), (uint32) recptr)));
- return false;
- }
-
- return true;
-}
-
-/*
* Attempt to read an XLOG record.
*
* If RecPtr is not NULL, try to read a record at that position. Otherwise
@@ -3811,290 +3708,35 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
* the returned record pointer always points there.
*/
static XLogRecord *
-ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
+ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode, bool fetching_ckpt)
{
XLogRecord *record;
- XLogRecPtr tmpRecPtr = EndRecPtr;
- bool randAccess = false;
- uint32 len,
- total_len;
- uint32 targetRecOff;
- uint32 pageHeaderSize;
- bool gotheader;
-
- if (readBuf == NULL)
- {
- /*
- * First time through, permanently allocate readBuf. We do it this
- * way, rather than just making a static array, for two reasons: (1)
- * no need to waste the storage in most instantiations of the backend;
- * (2) a static char array isn't guaranteed to have any particular
- * alignment, whereas malloc() will provide MAXALIGN'd storage.
- */
- readBuf = (char *) malloc(XLOG_BLCKSZ);
- Assert(readBuf != NULL);
- }
-
- if (RecPtr == NULL)
- {
- RecPtr = &tmpRecPtr;
- /*
- * RecPtr is pointing to end+1 of the previous WAL record. If
- * we're at a page boundary, no more records can fit on the current
- * page. We must skip over the page header, but we can't do that
- * until we've read in the page, since the header size is variable.
- */
- }
- else
- {
- /*
- * In this case, the passed-in record pointer should already be
- * pointing to a valid record starting position.
- */
- if (!XRecOffIsValid(*RecPtr))
- ereport(PANIC,
- (errmsg("invalid record offset at %X/%X",
- (uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
-
- /*
- * Since we are going to a random position in WAL, forget any prior
- * state about what timeline we were in, and allow it to be any
- * timeline in expectedTLIs. We also set a flag to allow curFileTLI
- * to go backwards (but we can't reset that variable right here, since
- * we might not change files at all).
- */
+ if (!XLogRecPtrIsInvalid(RecPtr))
lastPageTLI = 0; /* see comment in ValidXLogPageHeader */
- randAccess = true; /* allow curFileTLI to go backwards too */
- }
+
+ fetching_ckpt_global = fetching_ckpt;
/* This is the first try to read this page. */
failedSources = 0;
-retry:
- /* Read the page containing the record */
- if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
- return NULL;
-
- pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
- targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
- if (targetRecOff == 0)
+ do
{
- /*
- * At page start, so skip over page header. The Assert checks that
- * we're not scribbling on caller's record pointer; it's OK because we
- * can only get here in the continuing-from-prev-record case, since
- * XRecOffIsValid rejected the zero-page-offset case otherwise.
- */
- Assert(RecPtr == &tmpRecPtr);
- (*RecPtr) += pageHeaderSize;
- targetRecOff = pageHeaderSize;
- }
- else if (targetRecOff < pageHeaderSize)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record offset at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
- targetRecOff == pageHeaderSize)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("contrecord is requested by %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
-
- /*
- * Read the record length.
- *
- * NB: Even though we use an XLogRecord pointer here, the whole record
- * header might not fit on this page. xl_tot_len is the first field of
- * the struct, so it must be on this page (the records are MAXALIGNed),
- * but we cannot access any other fields until we've verified that we
- * got the whole header.
- */
- record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
- total_len = record->xl_tot_len;
-
- /*
- * If the whole record header is on this page, validate it immediately.
- * Otherwise do just a basic sanity check on xl_tot_len, and validate the
- * rest of the header after reading it from the next page. The xl_tot_len
- * check is necessary here to ensure that we enter the "Need to reassemble
- * record" code path below; otherwise we might fail to apply
- * ValidXLogRecordHeader at all.
- */
- if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
- {
- if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
- goto next_record_is_invalid;
- gotheader = true;
- }
- else
- {
- if (total_len < SizeOfXLogRecord)
+ record = XLogReadRecord(xlogreader, RecPtr, emode);
+ ReadRecPtr = xlogreader->ReadRecPtr;
+ EndRecPtr = xlogreader->EndRecPtr;
+ if (record == NULL)
{
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- gotheader = false;
- }
-
- /*
- * Allocate or enlarge readRecordBuf as needed. To avoid useless small
- * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
- * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with. (That is
- * enough for all "normal" records, but very large commit or abort records
- * might need more space.)
- */
- if (total_len > readRecordBufSize)
- {
- uint32 newSize = total_len;
+ failedSources |= readSource;
- newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
- newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
- if (readRecordBuf)
- free(readRecordBuf);
- readRecordBuf = (char *) malloc(newSize);
- if (!readRecordBuf)
- {
- readRecordBufSize = 0;
- /* We treat this as a "bogus data" condition */
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record length %u at %X/%X too long",
- total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- goto next_record_is_invalid;
- }
- readRecordBufSize = newSize;
- }
-
- len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
- if (total_len > len)
- {
- /* Need to reassemble record */
- char *contrecord;
- XLogPageHeader pageHeader;
- XLogRecPtr pagelsn;
- char *buffer;
- uint32 gotlen;
-
- /* Initialize pagelsn to the beginning of the page this record is on */
- pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-
- /* Copy the first fragment of the record from the first page. */
- memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
- buffer = readRecordBuf + len;
- gotlen = len;
-
- do
- {
- /* Calculate pointer to beginning of next page */
- XLByteAdvance(pagelsn, XLOG_BLCKSZ);
- /* Wait for the next page to become available */
- if (!XLogPageRead(&pagelsn, emode, false, false))
- return NULL;
-
- /* Check that the continuation on next page looks valid */
- pageHeader = (XLogPageHeader) readBuf;
- if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("there is no contrecord flag in log segment %s, offset %u",
- XLogFileNameP(curFileTLI, readSegNo),
- readOff)));
- goto next_record_is_invalid;
- }
- /*
- * Cross-check that xlp_rem_len agrees with how much of the record
- * we expect there to be left.
- */
- if (pageHeader->xlp_rem_len == 0 ||
- total_len != (pageHeader->xlp_rem_len + gotlen))
+ if (readFile >= 0)
{
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid contrecord length %u in log segment %s, offset %u",
- pageHeader->xlp_rem_len,
- XLogFileNameP(curFileTLI, readSegNo),
- readOff)));
- goto next_record_is_invalid;
+ close(readFile);
+ readFile = -1;
}
+ }
+ } while(StandbyMode && record == NULL);
- /* Append the continuation from this page to the buffer */
- pageHeaderSize = XLogPageHeaderSize(pageHeader);
- contrecord = (char *) readBuf + pageHeaderSize;
- len = XLOG_BLCKSZ - pageHeaderSize;
- if (pageHeader->xlp_rem_len < len)
- len = pageHeader->xlp_rem_len;
- memcpy(buffer, (char *) contrecord, len);
- buffer += len;
- gotlen += len;
-
- /* If we just reassembled the record header, validate it. */
- if (!gotheader)
- {
- record = (XLogRecord *) readRecordBuf;
- if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
- goto next_record_is_invalid;
- gotheader = true;
- }
- } while (pageHeader->xlp_rem_len > len);
-
- record = (XLogRecord *) readRecordBuf;
- if (!RecordIsValid(record, *RecPtr, emode))
- goto next_record_is_invalid;
- pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
- XLogSegNoOffsetToRecPtr(
- readSegNo,
- readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
- EndRecPtr);
- ReadRecPtr = *RecPtr;
- }
- else
- {
- /* Record does not cross a page boundary */
- if (!RecordIsValid(record, *RecPtr, emode))
- goto next_record_is_invalid;
- EndRecPtr = *RecPtr + MAXALIGN(total_len);
-
- ReadRecPtr = *RecPtr;
- memcpy(readRecordBuf, record, total_len);
- }
-
- /*
- * Special processing if it's an XLOG SWITCH record
- */
- if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- {
- /* Pretend it extends to end of segment */
- EndRecPtr += XLogSegSize - 1;
- EndRecPtr -= EndRecPtr % XLogSegSize;
-
- /*
- * Pretend that readBuf contains the last page of the segment. This is
- * just to avoid Assert failure in StartupXLOG if XLOG ends with this
- * segment.
- */
- readOff = XLogSegSize - XLOG_BLCKSZ;
- }
return record;
-
-next_record_is_invalid:
- failedSources |= readSource;
-
- if (readFile >= 0)
- {
- close(readFile);
- readFile = -1;
- }
-
- /* In standby-mode, keep trying */
- if (StandbyMode)
- goto retry;
- else
- return NULL;
}
/*
@@ -4223,88 +3865,6 @@ ValidXLogPageHeader(XLogPageHeader hdr, int emode)
}
/*
- * Validate an XLOG record header.
- *
- * This is just a convenience subroutine to avoid duplicated code in
- * ReadRecord. It's not intended for use from anywhere else.
- */
-static bool
-ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
- bool randAccess)
-{
- /*
- * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
- * required.
- */
- if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
- {
- if (record->xl_len != 0)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid xlog switch record at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
- else if (record->xl_len == 0)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with zero length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
- record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
- XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid record length at %X/%X",
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (record->xl_rmid > RM_MAX_ID)
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("invalid resource manager ID %u at %X/%X",
- record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- if (randAccess)
- {
- /*
- * We can't exactly verify the prev-link, but surely it should be less
- * than the record's own address.
- */
- if (!XLByteLT(record->xl_prev, *RecPtr))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with incorrect prev-link %X/%X at %X/%X",
- (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
- else
- {
- /*
- * Record's prev-link should exactly match our previous location. This
- * check guards against torn WAL pages where a stale but valid-looking
- * WAL record starts on a sector boundary.
- */
- if (!XLByteEQ(record->xl_prev, ReadRecPtr))
- {
- ereport(emode_for_corrupt_record(emode, *RecPtr),
- (errmsg("record with incorrect prev-link %X/%X at %X/%X",
- (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
- (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
- return false;
- }
- }
-
- return true;
-}
-
-/*
* Try to read a timeline's history file.
*
* If successful, return the list of component TLIs (the given TLI followed by
@@ -6089,6 +5649,7 @@ StartupXLOG(void)
bool backupEndRequired = false;
bool backupFromStandby = false;
DBState dbstate_at_startup;
+ XLogReaderState *xlogreader;
/*
* Read control file and check XLOG status looks valid.
@@ -6222,6 +5783,8 @@ StartupXLOG(void)
if (StandbyMode)
OwnLatch(&XLogCtl->recoveryWakeupLatch);
+ xlogreader = XLogReaderAllocate(InvalidXLogRecPtr, &XLogPageRead, NULL);
+
if (read_backup_label(&checkPointLoc, &backupEndRequired,
&backupFromStandby))
{
@@ -6229,7 +5792,7 @@ StartupXLOG(void)
* When a backup_label file is present, we want to roll forward from
* the checkpoint it identifies, rather than using pg_control.
*/
- record = ReadCheckpointRecord(checkPointLoc, 0);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 0);
if (record != NULL)
{
memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
@@ -6247,7 +5810,7 @@ StartupXLOG(void)
*/
if (XLByteLT(checkPoint.redo, checkPointLoc))
{
- if (!ReadRecord(&(checkPoint.redo), LOG, false))
+ if (!ReadRecord(xlogreader, checkPoint.redo, LOG, false))
ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
@@ -6271,7 +5834,7 @@ StartupXLOG(void)
*/
checkPointLoc = ControlFile->checkPoint;
RedoStartLSN = ControlFile->checkPointCopy.redo;
- record = ReadCheckpointRecord(checkPointLoc, 1);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1);
if (record != NULL)
{
ereport(DEBUG1,
@@ -6290,7 +5853,7 @@ StartupXLOG(void)
else
{
checkPointLoc = ControlFile->prevCheckPoint;
- record = ReadCheckpointRecord(checkPointLoc, 2);
+ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 2);
if (record != NULL)
{
ereport(LOG,
@@ -6591,7 +6154,7 @@ StartupXLOG(void)
* Allow read-only connections immediately if we're consistent
* already.
*/
- CheckRecoveryConsistency();
+ CheckRecoveryConsistency(EndRecPtr);
/*
* Find the first record that logically follows the checkpoint --- it
@@ -6600,12 +6163,12 @@ StartupXLOG(void)
if (XLByteLT(checkPoint.redo, RecPtr))
{
/* back up to find the record */
- record = ReadRecord(&(checkPoint.redo), PANIC, false);
+ record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);
}
else
{
/* just have to read next record after CheckPoint */
- record = ReadRecord(NULL, LOG, false);
+ record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
}
if (record != NULL)
@@ -6652,7 +6215,7 @@ StartupXLOG(void)
HandleStartupProcInterrupts();
/* Allow read-only connections if we're consistent now */
- CheckRecoveryConsistency();
+ CheckRecoveryConsistency(EndRecPtr);
/*
* Have we reached our recovery target?
@@ -6756,7 +6319,7 @@ StartupXLOG(void)
LastRec = ReadRecPtr;
- record = ReadRecord(NULL, LOG, false);
+ record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);
} while (record != NULL && recoveryContinue);
/*
@@ -6806,7 +6369,7 @@ StartupXLOG(void)
* Re-fetch the last valid or last applied record, so we can identify the
* exact endpoint of what we consider the valid portion of WAL.
*/
- record = ReadRecord(&LastRec, PANIC, false);
+ record = ReadRecord(xlogreader, LastRec, PANIC, false);
EndOfLog = EndRecPtr;
XLByteToPrevSeg(EndOfLog, endLogSegNo);
@@ -6905,8 +6468,15 @@ StartupXLOG(void)
* record spans, not the one it starts in. The last block is indeed the
* one we want to use.
*/
- Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
- memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
+ if (EndOfLog % XLOG_BLCKSZ == 0)
+ {
+ memset(Insert->currpage, 0, XLOG_BLCKSZ);
+ }
+ else
+ {
+ Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
+ memcpy((char *) Insert->currpage, xlogreader->readBuf, XLOG_BLCKSZ);
+ }
Insert->currpos = (char *) Insert->currpage +
(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
@@ -7063,17 +6633,7 @@ StartupXLOG(void)
close(readFile);
readFile = -1;
}
- if (readBuf)
- {
- free(readBuf);
- readBuf = NULL;
- }
- if (readRecordBuf)
- {
- free(readRecordBuf);
- readRecordBuf = NULL;
- readRecordBufSize = 0;
- }
+ XLogReaderFree(xlogreader);
/*
* If any of the critical GUCs have changed, log them before we allow
@@ -7104,7 +6664,7 @@ StartupXLOG(void)
* that it can start accepting read-only connections.
*/
static void
-CheckRecoveryConsistency(void)
+CheckRecoveryConsistency(XLogRecPtr EndRecPtr)
{
/*
* During crash recovery, we don't reach a consistent state until we've
@@ -7284,7 +6844,7 @@ LocalSetXLogInsertAllowed(void)
* 1 for "primary", 2 for "secondary", 0 for "other" (backup_label)
*/
static XLogRecord *
-ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
+ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int whichChkpt)
{
XLogRecord *record;
@@ -7308,7 +6868,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
return NULL;
}
- record = ReadRecord(&RecPtr, LOG, true);
+ record = ReadRecord(xlogreader, RecPtr, LOG, true);
if (record == NULL)
{
@@ -10100,19 +9660,21 @@ CancelBackup(void)
* sleep and retry.
*/
static bool
-XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
- bool randAccess)
+XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
+ bool randAccess, char *readBuf, void *private_data)
{
+ /* TODO: these, and fetching_ckpt, would be better in private_data */
static XLogRecPtr receivedUpto = 0;
+ static pg_time_t last_fail_time = 0;
+ bool fetching_ckpt = fetching_ckpt_global;
bool switched_segment = false;
uint32 targetPageOff;
uint32 targetRecOff;
XLogSegNo targetSegNo;
- static pg_time_t last_fail_time = 0;
- XLByteToSeg(*RecPtr, targetSegNo);
- targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
- targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
+ XLByteToSeg(RecPtr, targetSegNo);
+ targetPageOff = ((RecPtr % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+ targetRecOff = RecPtr % XLOG_BLCKSZ;
/* Fast exit if we have read the record in the current buffer already */
if (failedSources == 0 && targetSegNo == readSegNo &&
@@ -10123,7 +9685,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
* See if we need to switch to a new segment because the requested record
* is not in the currently open one.
*/
- if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
+ if (readFile >= 0 && !XLByteInSeg(RecPtr, readSegNo))
{
/*
* Request a restartpoint if we've replayed too much xlog since the
@@ -10144,12 +9706,12 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
readSource = 0;
}
- XLByteToSeg(*RecPtr, readSegNo);
+ XLByteToSeg(RecPtr, readSegNo);
retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
- (readSource == XLOG_FROM_STREAM && !XLByteLT(*RecPtr, receivedUpto)))
+ (readSource == XLOG_FROM_STREAM && !XLByteLT(RecPtr, receivedUpto)))
{
if (StandbyMode)
{
@@ -10192,17 +9754,17 @@ retry:
* XLogReceiptTime will not advance, so the grace time
* alloted to conflicting queries will decrease.
*/
- if (XLByteLT(*RecPtr, receivedUpto))
+ if (XLByteLT(RecPtr, receivedUpto))
havedata = true;
else
{
XLogRecPtr latestChunkStart;
receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart);
- if (XLByteLT(*RecPtr, receivedUpto))
+ if (XLByteLT(RecPtr, receivedUpto))
{
havedata = true;
- if (!XLByteLT(*RecPtr, latestChunkStart))
+ if (!XLByteLT(RecPtr, latestChunkStart))
{
XLogReceiptTime = GetCurrentTimestamp();
SetCurrentChunkStartTime(XLogReceiptTime);
@@ -10321,7 +9883,7 @@ retry:
if (PrimaryConnInfo)
{
RequestXLogStreaming(
- fetching_ckpt ? RedoStartLSN : *RecPtr,
+ fetching_ckpt ? RedoStartLSN : RecPtr,
PrimaryConnInfo);
continue;
}
@@ -10393,7 +9955,7 @@ retry:
*/
if (readSource == XLOG_FROM_STREAM)
{
- if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
+ if (((RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
{
readLen = XLOG_BLCKSZ;
}
@@ -10417,7 +9979,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not read from log segment %s, offset %u: %m",
fname, readOff)));
@@ -10433,7 +9995,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not seek in log segment %s to offset %u: %m",
fname, readOff)));
@@ -10443,7 +10005,7 @@ retry:
{
char fname[MAXFNAMELEN];
XLogFileName(fname, curFileTLI, readSegNo);
- ereport(emode_for_corrupt_record(emode, *RecPtr),
+ ereport(emode_for_corrupt_record(emode, RecPtr),
(errcode_for_file_access(),
errmsg("could not read from log segment %s, offset %u: %m",
fname, readOff)));
@@ -10501,7 +10063,7 @@ triggered:
* you are about to ereport(), or you might cause a later message to be
* erroneously suppressed.
*/
-static int
+int
emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
{
static XLogRecPtr lastComplaint = 0;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
new file mode 100644
index 0000000..8ba05b1
--- /dev/null
+++ b/src/backend/access/transam/xlogreader.c
@@ -0,0 +1,496 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogreader.c
+ * Generic xlog reading facility
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogreader.c
+ *
+ * NOTES
+ * Documentation about how do use this interface can be found in
+ * xlogreader.h, more specifically in the definition of the
+ * XLogReaderState struct where all parameters are documented.
+ *
+ * TODO:
+ * * usable without backend code around
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "catalog/pg_control.h"
+
+static bool ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr,
+ XLogRecord *record, int emode, bool randAccess);
+static bool RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode);
+
+/*
+ * Initialize a new xlog reader
+ */
+XLogReaderState *
+XLogReaderAllocate(XLogRecPtr startpoint,
+ XLogPageReadCB pagereadfunc, void *private_data)
+{
+ XLogReaderState *state;
+
+ state = (XLogReaderState *) palloc0(sizeof(XLogReaderState));
+
+ /*
+ * First time through, permanently allocate readBuf. We do it this
+ * way, rather than just making a static array, for two reasons: (1)
+ * no need to waste the storage in most instantiations of the backend;
+ * (2) a static char array isn't guaranteed to have any particular
+ * alignment, whereas malloc() will provide MAXALIGN'd storage.
+ */
+ state->readBuf = (char *) palloc(XLOG_BLCKSZ);
+
+ state->read_page = pagereadfunc;
+ state->private_data = private_data;
+ state->EndRecPtr = startpoint;
+
+ return state;
+}
+
+void
+XLogReaderFree(XLogReaderState *state)
+{
+ if (state->readRecordBuf)
+ pfree(state->readRecordBuf);
+ pfree(state->readBuf);
+ pfree(state);
+}
+
+/*
+ * Attempt to read an XLOG record.
+ *
+ * If RecPtr is not NULL, try to read a record at that position. Otherwise
+ * try to read a record just after the last one previously read.
+ *
+ * If no valid record is available, returns NULL, or fails if emode is PANIC.
+ * (emode must be either PANIC, LOG)
+ *
+ * The record is copied into readRecordBuf, so that on successful return,
+ * the returned record pointer always points there.
+ */
+XLogRecord *
+XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, int emode)
+{
+ XLogRecord *record;
+ XLogRecPtr tmpRecPtr = state->EndRecPtr;
+ bool randAccess = false;
+ uint32 len,
+ total_len;
+ uint32 targetRecOff;
+ uint32 pageHeaderSize;
+ bool gotheader;
+
+ if (RecPtr == InvalidXLogRecPtr)
+ {
+ RecPtr = tmpRecPtr;
+
+ /*
+ * RecPtr is pointing to end+1 of the previous WAL record. If
+ * we're at a page boundary, no more records can fit on the current
+ * page. We must skip over the page header, but we can't do that
+ * until we've read in the page, since the header size is variable.
+ */
+ }
+ else
+ {
+ /*
+ * In this case, the passed-in record pointer should already be
+ * pointing to a valid record starting position.
+ */
+ if (!XRecOffIsValid(RecPtr))
+ ereport(PANIC,
+ (errmsg("invalid record offset at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ randAccess = true; /* allow curFileTLI to go backwards too */
+ }
+
+ /* Read the page containing the record */
+ if (!state->read_page(state, RecPtr, emode, randAccess, state->readBuf, state->private_data))
+ return NULL;
+
+ pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ targetRecOff = RecPtr % XLOG_BLCKSZ;
+ if (targetRecOff == 0)
+ {
+ /*
+ * At page start, so skip over page header. The Assert checks that
+ * we're not scribbling on caller's record pointer; it's OK because we
+ * can only get here in the continuing-from-prev-record case, since
+ * XRecOffIsValid rejected the zero-page-offset case otherwise.
+ * XXX: does this assert make sense now that RecPtr is not a pointer?
+ */
+ Assert(RecPtr == tmpRecPtr);
+ RecPtr += pageHeaderSize;
+ targetRecOff = pageHeaderSize;
+ }
+ else if (targetRecOff < pageHeaderSize)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record offset at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
+ targetRecOff == pageHeaderSize)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("contrecord is requested by %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+
+ /*
+ * Read the record length.
+ *
+ * NB: Even though we use an XLogRecord pointer here, the whole record
+ * header might not fit on this page. xl_tot_len is the first field of
+ * the struct, so it must be on this page (the records are MAXALIGNed),
+ * but we cannot access any other fields until we've verified that we
+ * got the whole header.
+ */
+ record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+ total_len = record->xl_tot_len;
+
+ /*
+ * If the whole record header is on this page, validate it immediately.
+ * Otherwise do just a basic sanity check on xl_tot_len, and validate the
+ * rest of the header after reading it from the next page. The xl_tot_len
+ * check is necessary here to ensure that we enter the "Need to reassemble
+ * record" code path below; otherwise we might fail to apply
+ * ValidXLogRecordHeader at all.
+ */
+ if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+ {
+ if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+ goto next_record_is_invalid;
+ gotheader = true;
+ }
+ else
+ {
+ if (total_len < SizeOfXLogRecord)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ gotheader = false;
+ }
+
+ /*
+ * Allocate or enlarge readRecordBuf as needed. To avoid useless small
+ * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
+ * it's at least 4*Max(BLCKSZ, XLOG_BLCKSZ) to start with. (That is
+ * enough for all "normal" records, but very large commit or abort records
+ * might need more space.)
+ */
+ if (total_len > state->readRecordBufSize)
+ {
+ uint32 newSize = total_len;
+
+ newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
+ newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
+ if (state->readRecordBuf)
+ pfree(state->readRecordBuf);
+ state->readRecordBuf = (char *) palloc(newSize);
+ if (!state->readRecordBuf)
+ {
+ state->readRecordBufSize = 0;
+ /* We treat this as a "bogus data" condition */
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record length %u at %X/%X too long",
+ total_len, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ state->readRecordBufSize = newSize;
+ }
+
+ len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+ if (total_len > len)
+ {
+ /* Need to reassemble record */
+ char *contrecord;
+ XLogPageHeader pageHeader;
+ XLogRecPtr pagelsn;
+ char *buffer;
+ uint32 gotlen;
+
+ /* Initialize pagelsn to the beginning of the page this record is on */
+ pagelsn = (RecPtr / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+
+ /* Copy the first fragment of the record from the first page. */
+ memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len);
+ buffer = state->readRecordBuf + len;
+ gotlen = len;
+
+ do
+ {
+ /* Calculate pointer to beginning of next page */
+ XLByteAdvance(pagelsn, XLOG_BLCKSZ);
+ /* Wait for the next page to become available */
+ if (!state->read_page(state, pagelsn, emode, false, state->readBuf, NULL))
+ return NULL;
+
+ /* Check that the continuation on next page looks valid */
+ pageHeader = (XLogPageHeader) state->readBuf;
+ if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("there is no contrecord flag at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+ /*
+ * Cross-check that xlp_rem_len agrees with how much of the record
+ * we expect there to be left.
+ */
+ if (pageHeader->xlp_rem_len == 0 ||
+ total_len != (pageHeader->xlp_rem_len + gotlen))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid contrecord length %u at %X/%X",
+ pageHeader->xlp_rem_len,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ goto next_record_is_invalid;
+ }
+
+ /* Append the continuation from this page to the buffer */
+ pageHeaderSize = XLogPageHeaderSize(pageHeader);
+ contrecord = (char *) state->readBuf + pageHeaderSize;
+ len = XLOG_BLCKSZ - pageHeaderSize;
+ if (pageHeader->xlp_rem_len < len)
+ len = pageHeader->xlp_rem_len;
+ memcpy(buffer, (char *) contrecord, len);
+ buffer += len;
+ gotlen += len;
+
+ /* If we just reassembled the record header, validate it. */
+ if (!gotheader)
+ {
+ record = (XLogRecord *) state->readRecordBuf;
+ if (!ValidXLogRecordHeader(RecPtr, state->ReadRecPtr, record, emode, randAccess))
+ goto next_record_is_invalid;
+ gotheader = true;
+ }
+ } while (pageHeader->xlp_rem_len > len);
+
+ record = (XLogRecord *) state->readRecordBuf;
+ if (!RecordIsValid(record, RecPtr, emode))
+ goto next_record_is_invalid;
+ pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) state->readBuf);
+ state->ReadRecPtr = RecPtr;
+ state->EndRecPtr = pagelsn + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len);
+ }
+ else
+ {
+ /* Record does not cross a page boundary */
+ if (!RecordIsValid(record, RecPtr, emode))
+ goto next_record_is_invalid;
+ state->EndRecPtr = RecPtr + MAXALIGN(total_len);
+
+ state->ReadRecPtr = RecPtr;
+ memcpy(state->readRecordBuf, record, total_len);
+ }
+
+ /*
+ * Special processing if it's an XLOG SWITCH record
+ */
+ if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ {
+ /* Pretend it extends to end of segment */
+ state->EndRecPtr += XLogSegSize - 1;
+ state->EndRecPtr -= state->EndRecPtr % XLogSegSize;
+ }
+ return record;
+
+next_record_is_invalid:
+ return NULL;
+}
+
+/*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * ReadRecord. It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, int emode,
+ bool randAccess)
+{
+ /*
+ * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+ * required.
+ */
+ if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+ {
+ if (record->xl_len != 0)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid xlog switch record at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+ else if (record->xl_len == 0)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with zero length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+ record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+ XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (record->xl_rmid > RM_MAX_ID)
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("invalid resource manager ID %u at %X/%X",
+ record->xl_rmid, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ if (randAccess)
+ {
+ /*
+ * We can't exactly verify the prev-link, but surely it should be less
+ * than the record's own address.
+ */
+ if (!XLByteLT(record->xl_prev, RecPtr))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+ else
+ {
+ /*
+ * Record's prev-link should exactly match our previous location. This
+ * check guards against torn WAL pages where a stale but valid-looking
+ * WAL record starts on a sector boundary.
+ */
+ if (!XLByteEQ(record->xl_prev, PrevRecPtr))
+ {
+ ereport(emode_for_corrupt_record(emode, RecPtr),
+ (errmsg("record with incorrect prev-link %X/%X at %X/%X",
+ (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+ (uint32) (RecPtr >> 32), (uint32) RecPtr)));
+ return false;
+ }
+ }
+
+ return true;
+}
+
+
+/*
+ * CRC-check an XLOG record. We do not believe the contents of an XLOG
+ * record (other than to the minimal extent of computing the amount of
+ * data to read in) until we've checked the CRCs.
+ *
+ * We assume all of the record (that is, xl_tot_len bytes) has been read
+ * into memory at *record. Also, ValidXLogRecordHeader() has accepted the
+ * record's header, which means in particular that xl_tot_len is at least
+ * SizeOfXlogRecord, so it is safe to fetch xl_len.
+ */
+static bool
+RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
+{
+ pg_crc32 crc;
+ int i;
+ uint32 len = record->xl_len;
+ BkpBlock bkpb;
+ char *blk;
+ size_t remaining = record->xl_tot_len;
+
+ /* First the rmgr data */
+ if (remaining < SizeOfXLogRecord + len)
+ {
+ /* ValidXLogRecordHeader() should've caught this already... */
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid record length at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ remaining -= SizeOfXLogRecord + len;
+ INIT_CRC32(crc);
+ COMP_CRC32(crc, XLogRecGetData(record), len);
+
+ /* Add in the backup blocks, if any */
+ blk = (char *) XLogRecGetData(record) + len;
+ for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
+ {
+ uint32 blen;
+
+ if (!(record->xl_info & XLR_SET_BKP_BLOCK(i)))
+ continue;
+
+ if (remaining < sizeof(BkpBlock))
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid backup block size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ memcpy(&bkpb, blk, sizeof(BkpBlock));
+
+ if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect hole size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
+
+ if (remaining < blen)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("invalid backup block size in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+ remaining -= blen;
+ COMP_CRC32(crc, blk, blen);
+ blk += blen;
+ }
+
+ /* Check that xl_tot_len agrees with our calculation */
+ if (remaining != 0)
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect total length in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+
+ /* Finally include the record header */
+ COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
+ FIN_CRC32(crc);
+
+ if (!EQ_CRC32(record->xl_crc, crc))
+ {
+ ereport(emode_for_corrupt_record(emode, recptr),
+ (errmsg("incorrect resource manager data checksum in record at %X/%X",
+ (uint32) (recptr >> 32), (uint32) recptr)));
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index b5bfb7b..1ada664 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -229,6 +229,14 @@ extern const RmgrData RmgrTable[];
extern pg_time_t GetLastSegSwitchTime(void);
extern XLogRecPtr RequestXLogSwitch(void);
+
+/*
+ * Exported so that xlogreader.c can call this. TODO: Should be refactored
+ * into a callback, or just have xlogreader return the error string and have
+ * the caller of XLogReadRecord() do the ereport() call.
+ */
+extern int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+
/*
* These aren't in xlog.h because I'd rather not include fmgr.h there.
*/
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
new file mode 100644
index 0000000..d475a9b
--- /dev/null
+++ b/src/include/access/xlogreader.h
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * readxlog.h
+ *
+ * Generic xlog reading facility.
+ *
+ * Portions Copyright (c) 2012, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/include/access/xlogreader.h
+ *
+ * NOTES
+ * Check the definition of the XLogReaderState struct for instructions on
+ * how to use the XLogReader infrastructure.
+ *
+ * The basic idea is to allocate an XLogReaderState via
+ * XLogReaderAllocate, and call XLogReadRecord() until it returns NULL.
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOGREADER_H
+#define XLOGREADER_H
+
+#include "access/xlog_internal.h"
+
+struct XLogReaderState;
+
+/*
+ * The callbacks are explained in more detail inside the XLogReaderState
+ * struct.
+ */
+typedef bool (*XLogPageReadCB)(struct XLogReaderState *state,
+ XLogRecPtr RecPtr, int emode,
+ bool randAccess,
+ char *readBuf,
+ void *private_data);
+
+typedef struct XLogReaderState
+{
+ /* ----------------------------------------
+ * Public parameters
+ * ----------------------------------------
+ */
+
+ /* callbacks */
+
+ /*
+ * Data input function.
+ *
+ * This callback *has* to be implemented.
+ *
+ * Has to read XLOG_BLKSZ bytes that are at the location 'RecPtr' into the
+ * memory pointed at by 'readBuf' parameter. Returns true on success,
+ * false if the page could not be read.
+ */
+ XLogPageReadCB read_page;
+
+ /*
+ * this can be used by the caller to pass state to the callbacks without
+ * using global variables or such ugliness. It will neither be read or set
+ * by anything but your code.
+ */
+ void *private_data;
+
+ /* from where to where are we reading */
+
+ XLogRecPtr ReadRecPtr; /* start of last record read */
+ XLogRecPtr EndRecPtr; /* end+1 of last record read */
+
+ /* ----------------------------------------
+ * private/internal state
+ * ----------------------------------------
+ */
+
+ /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
+ char *readBuf;
+
+ /* Buffer for current ReadRecord result (expandable) */
+ char *readRecordBuf;
+ uint32 readRecordBufSize;
+} XLogReaderState;
+
+/*
+ * Get a new XLogReader
+ *
+ * At least the read_page callback, startptr and endptr have to be set before
+ * the reader can be used.
+ */
+extern XLogReaderState *XLogReaderAllocate(XLogRecPtr startpoint,
+ XLogPageReadCB pagereadfunc, void *private_data);
+
+/*
+ * Free an XLogReader
+ */
+extern void XLogReaderFree(XLogReaderState *state);
+
+/*
+ * Read the next record from xlog. Returns NULL on end-of-WAL or on failure.
+ */
+extern XLogRecord *XLogReadRecord(XLogReaderState *state, XLogRecPtr ptr, int emode);
+
+#endif /* XLOGREADER_H */
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WALwriting into this.
If you write something that filters/analyzes and then forwards WAL and
you want to do that without a big overhead (i.e. completely reassembling
everything, and then deassembling it again for writeout) its hard to do
that without integrating both sides.It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.
It works ;)
Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.
Well, if you want to fully split reading and writing of records - which is a
nice goal! - you basically need the full logic of XLogInsert again to split
them apart again to write them. Alternatively you need to store record
boundaries somewhere and copy that way, but in the end if you filter you need
to correct CRCs...
Also, I want to read records incrementally/partially just as data comes
in which again is hard to combine with writing out the data again.You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated.
Well, I just can say again: It works ;). Makes it easy to follow something like
XLogwrtResult without taking care about record boundaries.
I'd suggest keeping it simple for now, and optimize later if necessary.
Well, yes. The API should be able to comfortably support those cases though
which I don't think is neccesarily the case in a simple, one call API as
proposed.
Note that before you have the whole WAL record, you cannot CRC check it, so
you don't know if it's in fact a valid WAL record.
Sure. But you can start the CRC computation without any problems and finish it
when the last part of the data comes in.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.The problem with that is that kind of API is that it, at least as far as
I can see, that it never can operate on incomplete/partial input. Your
need to buffer larger amounts of xlog somewhere and you need to be aware
of record boundaries. Both are things I dislike in a more generic user
than xlog.c.I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.
Well, imagine something like the walsender that essentially follows the flush
position ideally without regard for record boundaries. It is nice to be able to
send/analyze/filter as soon as possible without waiting till a page is full.
And it sure would be nice to be able to read the data on the other side
directly from the network, decompress it again, and only then store it to disk.
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.
With the proposed API it seems pretty much a requirement to wait inside the
callback. Thats not really nice if your process has other things to wait for as
well.
In my proposal you can simply do something like:
XLogReaderRead(state);
DoSomeOtherWork();
if (CheckForForMessagesFromWalreceiver())
ProcessMessages();
else if (state->needs_input)
UseLatchOrSelectOnInputSocket();
else if (state->needs_output)
UseSelectOnOutputSocket();
but you can also do something like waiting on a Latch but *also* on other fds.
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Otherwise it looks sensible although it has a more minimal approach (which
might or might not be a good thing). The one thing I definitely like is that
nearly all of it is tried and true code...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 17.09.2012 12:07, Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.With the proposed API it seems pretty much a requirement to wait inside the
callback.
Or you can return false from the XLogPageRead() callback if the
requested page is not available. That will cause ReadRecord() to return
NULL, and you can retry when more WAL is available.
- Heikki
On 17.09.2012 13:01, Andres Freund wrote:
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.
- Heikki
On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
On 17.09.2012 13:01, Andres Freund wrote:
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.
That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a backend
environment with memory contexts around et al...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 17.09.2012 14:42, Andres Freund wrote:
On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
On 17.09.2012 13:01, Andres Freund wrote:
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a backend
environment with memory contexts around et al...
Hmm. I was thinking that making this work in a non-backend context would
be too hard, so I didn't give that much thought, but I guess there isn't
many dependencies to backend functions after all. palloc/pfree are
straightforward to replace with malloc/free. That's what we could easily
do with the error messages too, just malloc a suitably sized buffer.
How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?
- Heikki
On Monday, September 17, 2012 01:50:33 PM Heikki Linnakangas wrote:
On 17.09.2012 14:42, Andres Freund wrote:
On Monday, September 17, 2012 12:55:47 PM Heikki Linnakangas wrote:
On 17.09.2012 13:01, Andres Freund wrote:
On Monday, September 17, 2012 11:07:28 AM Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your
patch seems to be quite a good start. It misses xlogreader.h
though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
It seems we would need one additional callback for both approaches
like:->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.That seems a bit more complex from a memory management perspective as you
probably would have to sprintf() into some buffer. We cannot rely on a
backend environment with memory contexts around et al...Hmm. I was thinking that making this work in a non-backend context would
be too hard, so I didn't give that much thought, but I guess there isn't
many dependencies to backend functions after all. palloc/pfree are
straightforward to replace with malloc/free.
Hm. I thought that it was pretty much a design requirement that this is usable
outside of the backend environment?
That's what we could easily do with the error messages too, just malloc a
suitably sized buffer.
Not very comfortable though... Especially if you need to return an error from
the read_page callback...
How does a non-backend program get access to xlogreader.c? Copy
xlogreader.c from the source tree at build time and link into the
program? Or should we turn it into a shared library?
Not really sure. I thought about just putting it in pgport or such, but that
seemed ugly as well.
The bin/xlogdump hack, which I find really helpful, at first simply had a
dependency on ../../backend/access/transam/xlogreader.o which worked fine. Till
it needed more because of *_desc routines... But Alvaro started to work on this
although I don't know when he will be able to finish it.
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Monday, September 17, 2012 12:52:32 PM Heikki Linnakangas wrote:
On 17.09.2012 12:07, Andres Freund wrote:
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.With the proposed API it seems pretty much a requirement to wait inside
the callback.Or you can return false from the XLogPageRead() callback if the
requested page is not available. That will cause ReadRecord() to return
NULL, and you can retry when more WAL is available.
That requires to build quite a bit of knowledge on the outside:
* you need to transport the information that you need more input via some
external variable/->private_data
* you need to transport at which RecPtr you needed more data
* you need to signal that youre not dealing with an invalid record after
returning, given both conditions return NULL
* you need to buffer all incoming data somewhere if it comes from the network
or similar, because at the next call XLgReadRecord will restart reading from
the beginning
Sorry, if I sound sceptical! If I had your patch in my hands half a year ago I
would have been very happy, but after building the more generic version that
can do all of the above (including a compatible XLogReaderReadOne(state)) its a
bit hard to do that. Not sure if its just the feeling of possibly having wasted
the time...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 17.09.2012 13:01, Andres Freund wrote:
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.
Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.
I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.
Furthermore, returning enough state for useful error messages back out
of multiple layers of function call is going to be notationally messy,
and will end up requiring complicated infrastructure barely simpler than
elog anyway.
It'd be a lot better for the wal-dumping program to supply a cut-down
version of elog than to try to promise that all errors will be returned
back from ReadRecord.
regards, tom lane
On 17.09.2012 17:08, Tom Lane wrote:
Heikki Linnakangas<hlinnakangas@vmware.com> writes:
On 17.09.2012 13:01, Andres Freund wrote:
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.
I wasn't thinking that we'd completely eliminate all elog() calls from
ReadRecord and everything it calls, but only the "expected" ones that
mean we've reached the end of valid WAL. The ones that use
emode_for_corrupt_record(). Any unexpected errors like running out of
file descriptors would still use ereport() like usual.
That said, Andres' suggestion of making this facility completely
independent of any backend functions, making it usable in external
programs, doesn't actually seem that hard. ReadRecord() itself is fairly
small, as are the subroutines that validate the records. XLogReadPage(),
which goes out to fetch the right xlog page from archive or whatever, is
way more complicated. But that would live in the callback, so it would
be free to use all the normal backend facilities. However, it means that
external programs would need to supply their own (hopefully much
simpler) version of XLogReadPage(); I'm not sure how that goes with
Andres' plans on using xlogreader.
- Heikki
On Monday, September 17, 2012 04:08:01 PM Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 17.09.2012 13:01, Andres Freund wrote:
It seems we would need one additional callback for both approaches like:
->error(severity, format, ...)
For both to avoid having to draw in elog.c.Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.
If you start relying on palloc all hope is gone anyway. I "only" want a
standalone XLogReader because thats just too damn annoying/hard to duplicate in
standalone code. There are several very useful utilities out there that are
incomplete and/or unreliable for that reason. And loads of others that haven't
been written because of that.
That is one of the reasons - beside finding the respective xlog.c code very
hard to read/modify/extend - why I wrote a completely standalone xlogreader.
One other factor was just learning how the hell all that works ;)
I still think the interface that something plain as the proposed
XLogReadRecord() provides is too restrictive for many use-cases. I aggree that
a wrapper with exactly such an interface for xlog.c is useful, though.
Furthermore, returning enough state for useful error messages back out
of multiple layers of function call is going to be notationally messy,
and will end up requiring complicated infrastructure barely simpler than
elog anyway.
Hm. You mean because of file/function/location?
It'd be a lot better for the wal-dumping program to supply a cut-down
version of elog than to try to promise that all errors will be returned
back from ReadRecord.
Well, I suggested a ->error() callback for that reason, that seems relatively
easy to wrap.
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Monday, September 17, 2012 04:18:28 PM Heikki Linnakangas wrote:
On 17.09.2012 17:08, Tom Lane wrote:
Heikki Linnakangas<hlinnakangas@vmware.com> writes:
On 17.09.2012 13:01, Andres Freund wrote:
It seems we would need one additional callback for both approaches
like: ->error(severity, format, ...)
For both to avoid having to draw in elog.c.Yeah. Another approach would be to return the error string from
ReadRecord. The caller could then do whatever it pleases with it, like
ereport() it to the log or PANIC. I think I'd like that better.I think it's basically insane to imagine that you can carve out a
non-trivial piece of the backend that doesn't contain any elog calls.
There's too much low-level infrastructure, such as palloc, that could
call it. Even if you managed to make it safe at the instant the feature
is committed, the odds it would stay safe over time are negligible.I wasn't thinking that we'd completely eliminate all elog() calls from
ReadRecord and everything it calls, but only the "expected" ones that
mean we've reached the end of valid WAL. The ones that use
emode_for_corrupt_record(). Any unexpected errors like running out of
file descriptors would still use ereport() like usual.That said, Andres' suggestion of making this facility completely
independent of any backend functions, making it usable in external
programs, doesn't actually seem that hard. ReadRecord() itself is fairly
small, as are the subroutines that validate the records. XLogReadPage(),
which goes out to fetch the right xlog page from archive or whatever, is
way more complicated. But that would live in the callback, so it would
be free to use all the normal backend facilities. However, it means that
external programs would need to supply their own (hopefully much
simpler) version of XLogReadPage(); I'm not sure how that goes with
Andres' plans on using xlogreader.
XLogRead() from walsender.c is pretty easy to translate to backend-independent
code, so I don't think thats a problem. I don't see how the backend's version
is useful outside of the startup process anyway.
We could provide a default backend independent variant that hits files in
xlogreader.c, its not much code, to avoid others copying it multiple times.
I used a variant of that in the places that read from disk without any
problems. Obviously not in the places that read from network, but thats shelved
due to the different decoding approach atm anyway.
Regards,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi all,
Attached is the .txt and .pdf (both are imo readable and contain the same
content) with design documentation about the proposed feature.
Christan Kruse, Marko Tiikkaja and Hannu Krosing read the document and told me
about my most egregious mistakes. Thanks!
I would appreciate some feedback!
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
This time I really attached both...
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
DESIGN.pdfapplication/pdf; name=DESIGN.pdfDownload
%PDF-1.4
%����
4 0 obj
<<
/Title (High Level Design for Logical Replication in Postgres)
/Author (Andres Freund, 2ndQuadrant Ltd.)
/Creator (DocBook XSL Stylesheets with Apache FOP)
/Producer (Apache FOP Version 1.0)
/CreationDate (D:20120922183645+02'00')
>>
endobj
5 0 obj
<<
/N 3
/Length 20 0 R
/Filter /FlateDecode
>>
stream
x���wTS����7�P����khRH
�H�.*1 J�� "6DTpDQ��2(���C��"��Q��D�qp�Id���y�����~k����g�}�� ����LX ��X������g`� l �p��B�F�|��l���� ��*�?�� ����Y"1 P������\�8=W�%�O���4M�0J�"Y�2V�s�,[|��e9�2�<�s��e���'��9���`���2�&c�tI�@�o��|N6 (��.�sSdl-c�(2�-�y �H�_��/X������Z.$��&\S�������M����07�#�1��Y�r f��Yym�";�8980m-m�(�]����v�^��D���W~�
��e����mi ]�P����`/ ���u}q�|^R��,g+���\K�k)/����C_|�R����ax��8�t1C^7nfz�D����p������u�$��/�ED��L L��[���B�@���������������X�!@~ (* {d+��}�G�����������}W�L��$�cGD2�Q����Z4 E@�@����� �A(�q`1���D ������`'�u�4�6pt�c�48.��`�R0��)�
�@���R�t C���X��CP�%CBH@��R�����f�[�(t�
C��Qh�z#0 ��Z�l�`O8�����28.����p|��O���X
?���:��0�FB�x$ !���i@�����H���[EE1PL�������V�6��QP��>�U�(j
�MFk����� t,:��.FW�������8���c�1�L&�����9���a��X�:���
�r�bl1�
{{{;�}�#�tp�8_\<N�+�U�Zp'pWp����������e�F|~?��!(� ��HB*a-���F8K�KxA$��N�p����XI<D<O%�%QHf$6)�$!m!�'�"�"� ��Fdr<YL�Bn&�!�'�Q�*X*(�V+�(t*\Qx��W4T�T\���X�xDqH��^�H���QZ�T�tT����2U�F9T9Cy�r���G,���C�Q�(�(g(cT��OeS��u�F�Y�8
C3��Ri��oh��)���J�J�J��q)�����2�a�u�;U-UOU��&�6�+�����y���J���F���3�}�����w���@i�i�k�j��8��tm�����9�����5�4#4Wh�������������:��T�������C����U�MG��C���c�
�����d�1�t5u�u%�����3z�zQz�z�z�� �,�$����S:!��
��,��]�������b�6u=2V30�7n5�kB6q7Yf�`r�c�2M3�mz�6�7K1�12�������-�NB��L����le�Z�-�--�,�YX�[m����hmo�n�h}��bhSh�c����-�����\�\����v�}ngn���cw��jb������������a���1������
cmf�wB;y9�v:�����Y�|���K�K���y���������r\�]�n�D��nRw]w�{��}�G�����g��A�g^�^"���lg�J�)o�����{����S�s�W�7���w���o��)���6�Z�����@����}A��A�A���E�=!pH`�����
��w�������������}� � aQ����`����"�"�"�D�DI�z�����_�x���Hc�bW�^����u�c�������,��p<�>�8��"�Ey�.,�X�����%�%G��1�-��9��������K��l�.��oo���/�O$�&�'=JvM��<���R��T�T�������NM���)=&�=���qTH� �2�3�2�����������\6%
5eC�����4�����D�^2���S��&7:�H�r�0o`���M�'�}��^�Z�]�[�[��`t����U����zW��.Z=��o�����ik(�.,/|�.f]O�V�����~�[��E�76�l����(�8�i���MKx%K�K+J�o�n����W�_}���e���l�V�V������(W.�/���scG���;���PaWQ���K�KZ\�]eP���}uJ�H�WM{�f�����y������V�UWZ�n�`��z������}�}9�6F7�����I�����~�~���}����-�-e�p��u�`����x���l�o����$������A�{����}g�]m����\�9���%���>x������{����=Vs\�x� ����N���>�u�����c�Kz���=s�/�o�l����|�����?y������^d]���p�s�~���:;���/;]��7|�����W����p�������Q�o�H�!�����V����sn��Ys}���������~4��]� =>�=:�`��;c��'?e��~��!�a���D�#�G�&}'/?^�x�I�����?+�\����w�x�20;5�\�����_������e�t���W�f^��Qs�-�m���w3����+?�~�������O�~����
endstream
endobj
6 0 obj
[/ICCBased 5 0 R]
endobj
7 0 obj
<<
/Type /Metadata
/Subtype /XML
/Length 21 0 R
>>
stream
<?xpacket begin="���" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>High Level Design for Logical Replication in Postgres</dc:title>
<dc:creator>Andres Freund, 2ndQuadrant Ltd.</dc:creator>
<dc:date>2012-09-22T18:36:45+02:00</dc:date>
</rdf:Description>
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="">
<pdf:PDFVersion>1.4</pdf:PDFVersion>
<pdf:Producer>Apache FOP Version 1.0</pdf:Producer>
</rdf:Description>
<rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="">
<xmp:CreateDate>2012-09-22T18:36:45+02:00</xmp:CreateDate>
<xmp:CreatorTool>DocBook XSL Stylesheets with Apache FOP</xmp:CreatorTool>
<xmp:MetadataDate>2012-09-22T18:36:45+02:00</xmp:MetadataDate>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta><?xpacket end="r"?>
endstream
endobj
10 0 obj
<< /URI (http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com)
/S /URI >>
endobj
11 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 757.289 329.964 768.089 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I
>>
endobj
13 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 336.96 757.289 524.94 768.089 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I
>>
endobj
14 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 742.889 324.696 753.689 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 10 0 R
/H /I
>>
endobj
15 0 obj
<< /URI (http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com)
/S /URI >>
endobj
16 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 714.611 414.6 725.411 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I
>>
endobj
17 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 421.596 714.611 450.276 725.411 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I
>>
endobj
18 0 obj
<< /Type /Annot
/Subtype /Link
/Rect [ 78.0 700.211 483.996 711.011 ]
/C [ 0 0 0 ]
/Border [ 0 0 0 ]
/A 15 0 R
/H /I
>>
endobj
19 0 obj
<< /Length 22 0 R /Filter /FlateDecode >>
stream
x��WMs�6��W����$�o���N�v�6�=�C�LB'$!������lIT��x2���������L����Q$��������� ���(+X���I������H�,����
���:8x��?g���������"�Rz�������a4����>8-g?���oR�.�J&8���EL�I�q�#I�-}��m�\���������-��k��K�|��n}��l�$�����S���z�`��MGuG���K�������������s~��>�)��|�������V���(������I:�k����G�SW7'��!��I���-����N�`]��fsS�D!�L�&����0�X������@kk����&�U�l���.}o���<e2�e!�l��>N� b�E2�^y�~9��6���(�
3v9o�sj�_L75����G\D<����E ��N������{UY�yV�v��2�%�3~?�����ey,�.����'�Uc����W�Si:W;��rKUo�nI�JuK��'��[U����?_]���3�����i�I�,gi8|(��z��tX��e����R�e c4�(��Y�$�E��g� %�����������T������9j���r�����J��C
�������aU�+ ^�8���*�%��U���*��`�*�Y����A����s1{g��a��F�}�#��"@O��I~ VS�W>��yR�V���KS<]k�������(���M&�N�`z[�c��r��$ ������)M�~78@�m��ZyM��{���n"7�g���k�UW�l��M�G�`2��7�/���]]]��jp����+��R����Km��xI���P��������[ru���z07P��0�� ������T��~�Db�J����EY��t��)�Rr��K�L�nZ�4��+�Z 7�0����SI.q9B
�}�� ��H_�z.p��n'��U�0� -����tw�q��<9X��~Y�m���E��-���O�4�Rut��U�����������rz��*t3z��i]=��O�b9R�Si�/ ��� [y��
���}D�����B��������jMw�������� �i* ���9����*
�����*�}yfq��$Md���^�N���x�t�s��;����q3��q��\�1s����$j��`�n�u�X���c��D�����]
�wy���K�JPT.8v�_Lvb~��bC��q�D,* �a�9���+���mj��a�����.���u�^
��4�R�;4�3���e���
[�%`��f�.���������YZ ��u� _�&v�H@��y�������{z��C�
FHj��B��T�A��|��p����bxo��O�x��6�9�E����%A�p����,\�x*���A\�([#��T������!/�8�����i/g8.�U��|j�(+*t��@X�V���
���K\nS;
�������K?����/b��dR�$C��q�������xM�VN����bhK$�d����1�o�(�����|���X� ��B���c�������/h�Uw���b�j����&�KG�5���\�������
endstream
endobj
12 0 obj
[
11 0 R
13 0 R
14 0 R
16 0 R
17 0 R
18 0 R
]
endobj
8 0 obj
<<
/Resources 3 0 R
/Type /Page
/MediaBox [0 0 595.275 841.889]
/CropBox [0 0 595.275 841.889]
/BleedBox [0 0 595.275 841.889]
/TrimBox [0 0 595.275 841.889]
/Parent 1 0 R
/Annots 12 0 R
/Contents 19 0 R
>>
endobj
20 0 obj
2596
endobj
21 0 obj
938
endobj
22 0 obj
1627
endobj
24 0 obj
<< /Length 25 0 R /Filter /FlateDecode >>
stream
x��WK��F��Wt.T����I !��Mq cil$����������ZZ��2�����[�W>y�����Y�S�����>���x�d"
b�)����C
=<�������x��Dzzw���xGW����V���FI�������
|X���a���8<��^��n^��t��/��� �!����<O<�r�m��������U?TM*[m[�hC��mU������^����X�D�����W�w\�[�Zz���e�n$:���y~� �>�c�]���b<�� ������Y��$��~u}�?U������Gm������cG`.��� n
���Z��`������}M��v��\��z���j�T����Y**t������Z]���0q�v������Xa������v��6n�>����_dfA����\�bW9U����-��H�,[`"�)�(U���&{�N5���)Cn���Dvi�<��o�/�����4r�S����
"�Z��0#��+�����"�����N��+%�d[��s��X�!.�Q��|"���J��x���P���z���rMW���(������Q�����"���zB�����1z"��2��A�K����&Im���Xo���,H��*��4������~��4%�
Q�P����3z��4Yq"��EL���#����I6�k;2b��S�i�zK�6����f�b���N�E2���c[��nug�\�>t����q$�\�^G�A.�D���*T��?��w����XG�4FK���&��N����[}���\Y9x=ad��x�����Sw�I����,��L��#�0��2�+4'����8{_�������R/FA(�
�-�f"������Q�"
!%#���{U�f$��q"�@78wpm��=�;�$��6������-��o���3g�a�d����,�D��6
��l���O=�x����Zoo�w����^x���{�!�XA�;��_qO����#Dy��A�G���LO�
#����(w|�Vi�Z��;�����k�
.���MM�P�C�,���kA��-�t���/ �T�g�50-�gqa"����[|*�[�Em���G�%5��Fv����=�D L[��SV?��{X�9�D� �,�F�{�f�P�=l]n��:W�+���T�~����9���*��z �06p�����u�F1�I��2���7h3��"�����\��i�������g �y���~�s� 8���n����[v�����lk^��2�1,���V�������8�3P}6J��<�=��=�Q������d�+����n��!9T`���Us[�����������Ao8���������^��#_��V�H�A���^
D ��*t?����6/�aO��V��s�;u�-��������H�z�U����*��H�ABz�}WK!��)�%�m<�w��:;C�I��(��,
��>kG�s���r9�'����}T}A�\|��=�#����t��R�T5��O��BS��6 Z��:��w����JE�&
endstream
endobj
23 0 obj
<<
/Resources 3 0 R
/Type /Page
/MediaBox [0 0 595.275 841.889]
/CropBox [0 0 595.275 841.889]
/BleedBox [0 0 595.275 841.889]
/TrimBox [0 0 595.275 841.889]
/Parent 1 0 R
/Contents 24 0 R
>>
endobj
25 0 obj
1547
endobj
27 0 obj
<< /Length 28 0 R /Filter /FlateDecode >>
stream
x��VKs�6��W���q<I���n��d�q�NIIlIBH;���B�I�xN��%��������?�-��@�.>���Q�G�&�+JpC�4�-�ss8n�r�4^9<iR>�;pr����nA����'��J<.���g�w'>� ��=�Aa���~�����~�5e��"� C�)(���}~���l��{p
��Xo:X� o���l��'�����VI���.�0I�aGA�]�w��wPw���~\<����,�BD�q�;s��4��K>�� ��,���������8Q���WM����]��T���q��V3�3�*X������o���Tx�k��<#�����z�q��:"Xh�xb�RG�G$�%ke[u��].�(�z�6s�.Qq��w_��r7����>U?��[��d�U���}��`KI,-��3���uU�����s����pqU&����]��~}��
���a�������.�����RQR� S�c�Tf��?�m��a(�`W�:5S��1��S'=
+����R�s�qq�U��;?D�;�m��+�k�IeR��lQ���Pk��w��*�a����nB��&��D%��%�mm�j\���7bA�����j�����7�w��@�g
�p�GK��+>����s��T����4�u)�c,���������OdI���q#gUd�g6T[��#}�+*)RDQ�*�p�(�m�O��KH������ �F�K1����a��b�����:���d���Y���=T�[��!�e�
�����V!�W��K�
��}������_�:,SZ���lP�%#B����6A���M|�;5Az_Fqx��^F��;�e��]pu�~�bV1�������e������������W��I����E��/x3L�T1M(�B�M��� ��56md!l���L��[�?� �n�}2�_5����q�Lh�aJ��Tc���� ,� ��R����3���0z������H&M�����:$N���2![��!���H�pI6{�>7'_iR������gO��
N���j�p}�LP��EZ��Xg&t����_"�s�%������#r�846���w�����'
endstream
endobj
26 0 obj
<<
/Resources 3 0 R
/Type /Page
/MediaBox [0 0 595.275 841.889]
/CropBox [0 0 595.275 841.889]
/BleedBox [0 0 595.275 841.889]
/TrimBox [0 0 595.275 841.889]
/Parent 1 0 R
/Contents 27 0 R
>>
endobj
28 0 obj
1099
endobj
30 0 obj
<< /Length 31 0 R /Filter /FlateDecode >>
stream
x��W�r�8��+�6I��pyL*�Y*3N���!�"!e` ���> ��\�l�X�_w����.B
���}�I�����[|��CJ��y(�Y�AST� )�CG�n���m���r���8n�n�����;*
XJ�E���n�b���W�#�z��v��}JaD��}���EQ��yL+�,�����>��S�
}���w��Z�Z��kY�����{oh�$����fN��}�=��������v2���Q��8c����OG����"e����y#������W��d$�qM�OB�����9����O�i��!���(P8������s��i�;�J�p=���L�������U+n����C#
�l�n!ca���m�������t���wo�����^�}��/O\X�4�a?��l��nl���v�O����8w�-Y������:[��A��d���j���F�h�h��p���V�ZqE�q �������}{�s�e���*K����d�����bCoL�Hd}��l�c��F��w����B���Q�G)6����.Q�8�@%��z4��hxK�2�i�h�$�������H�5�����Eu��+R���oI���>�W&,��c`_&3���-)!*`%hc�0��(�jbX?�D���H|�tErpx��
�o�}50v��X���\�V�s4Q ���d���8Y�p8��#�et`\y�����jK�(����6��|���E����u����c@���KI��[M�N��F��}:�CS����mi��@zW�����j����uC�0c0\Y����<����J�\�F�&$��>�%�6v�\9bUZ w�c����Yz�A�P�lF)o��������qT=�?9G��F� *��!;CD�p7�4<sz==x��k��T���n�2?��U|���;=�\z}!�<8�`5����F@ �\��AC����.<G��^4��O�m���+
��\�8�Al���Lu.!�����,�f�h�p��q�s�E���1V;9:��B�6�o����R�Z�I'm�iU%e5B�t���u���W��P� pV%��e��w����N/]�\~t����g� K�\7M�GvP������5M/J���������/��E�]��9Q}��J�4v`TsA2���1��Z4 /�������7�c������5��;����E{s�
mG�6{��I��[B�|n� N���� 5��!{vp����]z�
endstream
endobj
29 0 obj
<<
/Resources 3 0 R
/Type /Page
/MediaBox [0 0 595.275 841.889]
/CropBox [0 0 595.275 841.889]
/BleedBox [0 0 595.275 841.889]
/TrimBox [0 0 595.275 841.889]
/Parent 1 0 R
/Contents 30 0 R
>>
endobj
31 0 obj
1262
endobj
33 0 obj
<<
/Name /Im1
/Type /XObject
/Length 34 0 R
/Filter /FlateDecode
/Subtype /Image
/Width 540
/Height 700
/BitsPerComponent 8
/ColorSpace [/ICCBased 5 0 R]
>>
stream
x���yxU�����F"��� ���@@�{�G�
( �^Fq�
w���aD�
������(��%��@ LX�HB�b��"����PO����tUwuU?��s��9o����{NU���uB!�B!�B,�o�B�[PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5�B�^PS!��5���\�o+�ce�a�'p���u������Q�f��EEE�w�}�D�;vl����B����������4e��8���/^�x������������F#o��V\r���:3 !:BM1
�h��u��v;u��S%%%�]^�K��}B�j�Q�zUUU��M���h����_�j�8��K���gcI/]�l���;v��m�����)�����7�����i��M�4����v�%%%����q��������YYYRF��h�PUrr�/����Q��V�v��}���M�?~ZZ��+W���3gn��������_ef��%Nm��I���Zp�"���P�$�����
HJ�"J�����K�.���KH���_��.�xbb�<��m���G��c�k_!��Ae\����]�������������/�Ltt�H0 �}������{���G�A$SRR���E^�������R[��D����9v(l����,��E��6������)�xj�Q�z555��.L�<9*** @���c���p$�E;�5�I�&���)/)=��v���Z�F� �����>$$�;v���lj
!>5�(���C��j�#"YUU%/�eMqv�k#��7l�`���a��`(5�_��b��^hh�<����S^���6���4h�B��}����}9�����
���Z���]��lmD����B|j�Q�z�F��o�Y^^���`+/��7��T��o��
���w��xG/��������3k]�"����c��9��.\�|�$Qx���.����r���������.j
!� 5�(�����1c��������X������~��I����%K���(��uk�g��������6y�k#]�Q�F����� k��������Ky��YYY���h�y�����)���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!zAM!���B!z�}M�������D����R}�_���6_����(%v�K������UUU��k�+��W��e_�m����z�����z��z�E_��^����p�Bqq1=��J=��z|�Wr��,�O�m�(��t���;� �t��������z�5��z�+������AV*++=������-�Z��&��������1l�1��W���������v�����2O�m�(��t���;� �����as��1z�5��z�+������������x:n�Gi��[����M����Wo��555�s
}��J=��+�m�N����_<������-�Z��&������7�:t�s
}��J=��+�m?��35����h��aXX�=����k�]�rE{�xm���-��3}E_�Y^c�,..
�����8������v�)�����-��)�m�O�5jr���G{�Xl��W�U��5���B!+.����������)����#�e���z��s�����,�1N��������W�k�)zAM����4m�T��{E1���t�2{��������G����[�U�V9����m��Q#�����2d�������0L�M�6�Y�f�L���������gggK>����;��-[fl���^E_��P�������?�X|]��S����kN������-[����7i��O�>H�����K�B*�������{��j�{(��t�Z4�����=9c������9399������K�����Y�f�S�9�j�v��o�^QQq������+k��gO�-0�� �s�����~��a�O�8�1������9r���411��������mC$�x?N�W��^��x���q��y�p<w�\��3G^y�
������
��aC(��;��BV��}1b�T��@�| ����z�)�����-��)v`����O�_�r���Er���H�[��Y�8�Y� 1�������8���!�.\@�!�����2�^��j�)I$�����8I_����R� ��
WAA���9���M��r��k��RS�,�FDr���Hb���_����b����s�p�B��0������(6j������8<{��X�=���"���p���QQQrRp�hg��F��TD���c"�s�=�Y~V\h�4eDR�k�����1�Hb���8)%�+�W�W�/�^���6L�:t(r R���h����)��>�(�o��&��x�
O�0A�r�w���W��FM�Q��Q���/"'$$D$�����HVWW�/����F��!C���b�����>}E_�y�n��'�|�����zJ�\�� �hg��Z���s�N$���N���������h����QSd��w���97�|�H�������]���4����wVsjjj�V�pp���gffJg�s���$�V�s���o3�y�<��rq}��B�-Z4k����B��1rZ�l��H�WVV���3�A�v��=�w$�}]��E�:uB��E���R�r�w������Sg9�m����w��}��9���O����N�z�-�p���;�/9p�@���o���o��������&M�����c��r���s���og������j��+77w�������<g��-6��;�W��^��~�3W�7�������t�xU��k�!�����n�[i��
�'N��r�;zqGyyy8���^}�U[��q,���;������ZO����H`�`�������^:{���1c������cccW�\i�A��G���3 ��g>O�:��$�H,�R3!!�Mdb:`B�������Y�dItt4�BLL��7�$}E_��+������
�����k���P�������X����������uk�L�Y�&��*�&�����Q�F��75�7Sn��z�����}�0G�m�-��?�5���J=��z|�W�9��;��/�g_��R}�_�5E����o��������C_���}EM��k�����x�5�+��W��e_QS��Z��&�<�}
�J=��z|�W�9��;��/�g_��R}�_�5E����o��������C_���}EM��k�����x�5�+��W��e_QS��Z��&�<�}
�J=��z|�W�9���6Raaa�����0����R}�_���6j
())������HII��q�J �d� ��3���{�y���}��$}������mCD�B�V�!�RZZz���������={�l���F�`<��Kgo�5���7\c��\��������W�#�m��^��*1DS***�����������>�����|
x����n�5���7\c��\��������W�#�m��^��*1DS���!�p6''����7�{��^*���<���p���r�z��R�F_y����!�z!n��M�r�
� m�t�Rqq�E �o� ��3��T}�y���}��$}������mCD�B�V�!�"�v��>