[RFC][PATCH] wal decoding, attempt #2

Started by Andres Freundover 13 years ago111 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

It took me far longer than I planned, its not finished, but time is running
out. I would like some feedback that I am not going astray at this point...
*I* think the general approach is sound and a good way forward that provides
the basic infrastructure for many (all?) of the scenarios we talked about
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING: COMMIT
decode_xlog
-------------
t
(1 row)

As you can see the patchset can decode several changes made to a table even
though we used DDL on it. Not everything is handled yet, but its a prototype
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special
kind of Snapshot. This works in a somewhat similar way to the
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal
SnapshotNow semantics used for catalog access with one of the previously built
snapshots. They should behave just the same as a normal SnapshotNow would have
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for
anything real, but I think thats good. Lets get this into something
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01]: Add embedded list interface (header only)

A new piece of infrastructure (for k-way mergesort), pretty much untested,
good idea in general I think, not very interesting:
[02]: Add minimal binary heap implementation

Boring, old.:
[03]: Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04]: add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems
to earlier, not interesting at this point.
[05]: Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)
relfilenode)

Unchanged:
[06]: Log enough data into the wal to reconstruct logical changes from it if wal_level=logical
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07]: Make InvalidateSystemCaches public

The major piece:
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling

[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
The snapshot building has the most critical infrastructure but misses
several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not
pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction
commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The decode_xlog() function is *purely* a debugging tool that I do not want to
keep in the long run. I introduced it so we can concentrate on the topic at
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal

I have some work ontop of this, that handles ComboCid's and CommandId's
correctly (and thus mixed ddl/dml transactions), but its simply not finished
enough. I am pretty sure by now that it works even with those additional
complexities.

So, I am unfortunately too tired to write more than this... It will have to
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the
implementation? Questions?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 1/8] Add embedded list interface (header only)

Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.

Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---
src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 253 insertions(+)
create mode 100644 src/include/utils/ilist.h

Attachments:

0001-Add-embedded-list-interface-header-only.patchtext/x-patch; name=0001-Add-embedded-list-interface-header-only.patchDownload+253-0
#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 2/8] Add minimal binary heap implementation

This is basically untested.
---
src/backend/lib/Makefile | 2 +-
src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
src/include/lib/simpleheap.h | 91 +++++++++++++++
3 files changed, 347 insertions(+), 1 deletion(-)
create mode 100644 src/backend/lib/simpleheap.c
create mode 100644 src/include/lib/simpleheap.h

Attachments:

0002-Add-minimal-binary-heap-implementation.patchtext/x-patch; name=0002-Add-minimal-binary-heap-implementation.patchDownload+347-1
#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
src/include/access/xlogreader.h | 264 ++++++++
3 files changed, 1297 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogreader.c
create mode 100644 src/include/access/xlogreader.h

Attachments:

0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchtext/x-patch; name=0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchDownload+1297-1
#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 4/8] add simple xlogdump tool

---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 ++++
src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 360 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c

Attachments:

0004-add-simple-xlogdump-tool.patchtext/x-patch; name=0004-add-simple-xlogdump-tool.patchDownload+360-1
#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 5/8] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)

This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids relfilenode entries
for nailed/shared catalog entries. Those values cannot be sensibly queries from
the catalog anyway though (the relmapper infrastructure needs to be used).

It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---
src/backend/utils/cache/syscache.c | 11 +++++++++++
src/include/catalog/indexing.h | 2 ++
src/include/catalog/pg_proc.h | 1 +
src/include/utils/syscache.h | 1 +
4 files changed, 15 insertions(+)

Attachments:

0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchtext/x-patch; name=0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchDownload+15-0
#7Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 6/8] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical

This adds a new wal_level value 'logical'

Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---
src/backend/access/heap/heapam.c | 135 +++++++++++++++++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 +++++++++++++++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
6 files changed, 207 insertions(+), 12 deletions(-)

Attachments:

0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchtext/x-patch; name=0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchDownload+207-12
#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 7/8] Make InvalidateSystemCaches public

Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode
---
src/backend/utils/cache/inval.c | 2 +-
src/include/utils/inval.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

Attachments:

0007-Make-InvalidateSystemCaches-public.patchtext/x-patch; name=0007-Make-InvalidateSystemCaches-public.patchDownload+3-1
#9Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
[PATCH 8/8] Introduce wal decoding via catalog timetravel

This introduces several things:
* applycache module which reassembles transactions from a stream of interspersed changes
* snapbuilder which builds catalog snapshots so that tuples from wal can be understood
* wal decoding into an applycache
* decode_xlog(lsn, lsn) debugging function

The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit

It is missing several parts:
- spill-to-disk
- resource usage controls
- command id handling
- passing of the correct mvcc snapshot (already has it, just doesn't pass)

The snapshot building has the most critical infrastructure but misses several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The xlog decoding also misses several parts:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- handling of table rewrites
---
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/applycache.c | 574 +++++++++++++
src/backend/replication/logical/decode.c | 366 +++++++++
src/backend/replication/logical/logicalfuncs.c | 237 ++++++
src/backend/replication/logical/snapbuild.c | 1045 ++++++++++++++++++++++++
src/backend/utils/time/tqual.c | 161 ++++
src/include/access/transam.h | 5 +
src/include/catalog/pg_proc.h | 3 +
src/include/replication/applycache.h | 239 ++++++
src/include/replication/decode.h | 26 +
src/include/replication/snapbuild.h | 119 +++
src/include/utils/tqual.h | 21 +-
13 files changed, 2816 insertions(+), 1 deletion(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/applycache.c
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/applycache.h
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/snapbuild.h

Attachments:

0008-Introduce-wal-decoding-via-catalog-timetravel.patchtext/x-patch; name=0008-Introduce-wal-decoding-via-catalog-timetravel.patchDownload+2816-1
#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
git tree

Hi,

A last note:

A git tree of this is at
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-
decoding-rebasing-cf2

checkout with:

git clone --branch xlog-decoding-rebasing-cf2
git://git.postgresql.org/git/users/andresfreund/postgres.git

Webview:

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-
decoding-rebasing-cf2

That branch will be regularly rebased to a new master,fixes/new features, and
a pgindent run over the new files...

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#10)
Re: git tree

On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote:

That branch will be regularly rebased to a new master,fixes/new features,
and a pgindent run over the new files...

I fixed up the formatting of the new stuff (xlogreader, ilist are submitted
separately, no point in doing anything there).

pushed to the repo mentioned upthread.

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Add pg_relation_by_filenode(reltbspc, filenode) admin function

Now that I proposed a new syscache upthread its easily possible to provide
pg_relation_by_filenode which I wished for multiple times in the past when
looking at filesystem activity and wondering which table does what. You can
sortof get the same result via

SELECT oid FROM (
SELECT oid, pg_relation_filenode(oid::regclass) filenode
FROM pg_class WHERE relkind != 'v'
) map
WHERE map.filenode = ...;

but thats neither efficient nor obvious.

So, two patches to do this:

Did others need this in the past? I can live with the 2nd patch living in a
private extension somewhere. The first one would also be useful for some
error/debug messages during decoding...

Greetings,

Andres

#13Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
[PATCH 1/2] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode

---
src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
src/include/utils/relmapper.h | 2 ++
2 files changed, 55 insertions(+)

Attachments:

0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchtext/x-patch; name=0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchDownload+55-0
#14Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
[PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

This requires the previously added RELFILENODE syscache.
---
doc/src/sgml/func.sgml | 23 ++++++++++++-
src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++
src/include/catalog/pg_proc.h | 2 ++
src/include/utils/builtins.h | 1 +
4 files changed, 103 insertions(+), 1 deletion(-)

Attachments:

0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchtext/x-patch; name=0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchDownload+103-1
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

Andres Freund <andres@2ndquadrant.com> writes:

This requires the previously added RELFILENODE syscache.

[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).

regards, tom lane

#16Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: [PATCH 2/2] Add a new function pg_relation_by_filenode to lookup up a relation given the tablespace and the filenode OIDs

On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

This requires the previously added RELFILENODE syscache.

[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).

Well, one patch upthread ;). It mentions the problem of it not being unique due
to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed
catalogs.

I am not really sure yet how big a problem for the caching infrastructure it is
that values that shouldn't ever get queried (because the relfilenode is
actually different) are duplicated. Reading code about all that atm.

Robert suggested writing a specialized cache akin to whats done in
attoptcache.c or such.

I haven't formed an opinion on whats the way forward on that topic. But anyway,
I don't see how the wal decoding stuff can progress without some variant of
that mapping, so I sure hope I/we can build something. Changing that aspect of
the patch should be trivial...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#4)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 15.09.2012 03:39, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.

My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog. 2. We should focus on
reading WAL, I don't see the point of mixing WAL writing into this. 3. I
don't like the callback-style API.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.

So the usage is like this:

/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}

state = XLogReaderAllocate(&MyXLogPageRead);

while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}

XLogReaderFree(state);

- Heikki

Attachments:

xlogreader-heikki-1.patchtext/x-diff; name=xlogreader-heikki-1.patchDownload+574-508
#18Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#17)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

Hi Heikki,

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:

Features:
- streaming reading/writing
- filtering
- reassembly of records

Reusing the ReadRecord infrastructure in situations where the code that
wants to do so is not tightly integrated into xlog.c is rather hard and
would require changes to rather integral parts of the recovery code
which doesn't seem to be a good idea.

My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog.

Yes. I aggree. And I am willing to provide an implementation of this if should
my xlogreader variant gets a bit more buyin.

2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.

Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.

3. I don't like the callback-style API.

I tried to accomodate to that by providing:

extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);

which does exactly that.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.

There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.

Makes sense.

So the usage is like this:

/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}

state = XLogReaderAllocate(&MyXLogPageRead);

while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}

If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...

Do my aims make any sense to you?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#18)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.

It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.

Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.

Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.

You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated. I'd suggest keeping
it simple for now, and optimize later if necessary. Note that before you
have the whole WAL record, you cannot CRC check it, so you don't know if
it's in fact a valid WAL record.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.

I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

- Heikki

Attachments:

xlogreader-heikki-2.patchtext/x-diff; name=xlogreader-heikki-2.patchDownload+675-508
#20Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#19)
Re: [PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader

On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:

On 17.09.2012 11:12, Andres Freund wrote:

On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:

On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WAL

writing into this.
If you write something that filters/analyzes and then forwards WAL and
you want to do that without a big overhead (i.e. completely reassembling
everything, and then deassembling it again for writeout) its hard to do
that without integrating both sides.

It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.

It works ;)

Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.

Well, if you want to fully split reading and writing of records - which is a
nice goal! - you basically need the full logic of XLogInsert again to split
them apart again to write them. Alternatively you need to store record
boundaries somewhere and copy that way, but in the end if you filter you need
to correct CRCs...

Also, I want to read records incrementally/partially just as data comes
in which again is hard to combine with writing out the data again.

You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated.

Well, I just can say again: It works ;). Makes it easy to follow something like
XLogwrtResult without taking care about record boundaries.

I'd suggest keeping it simple for now, and optimize later if necessary.

Well, yes. The API should be able to comfortably support those cases though
which I don't think is neccesarily the case in a simple, one call API as
proposed.

Note that before you have the whole WAL record, you cannot CRC check it, so
you don't know if it's in fact a valid WAL record.

Sure. But you can start the CRC computation without any problems and finish it
when the last part of the data comes in.

I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.

The problem with that is that kind of API is that it, at least as far as
I can see, that it never can operate on incomplete/partial input. Your
need to buffer larger amounts of xlog somewhere and you need to be aware
of record boundaries. Both are things I dislike in a more generic user
than xlog.c.

I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.

Well, imagine something like the walsender that essentially follows the flush
position ideally without regard for record boundaries. It is nice to be able to
send/analyze/filter as soon as possible without waiting till a page is full.
And it sure would be nice to be able to read the data on the other side
directly from the network, decompress it again, and only then store it to disk.

The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.

With the proposed API it seems pretty much a requirement to wait inside the
callback. Thats not really nice if your process has other things to wait for as
well.

In my proposal you can simply do something like:

XLogReaderRead(state);

DoSomeOtherWork();

if (CheckForForMessagesFromWalreceiver())
ProcessMessages();
else if (state->needs_input)
UseLatchOrSelectOnInputSocket();
else if (state->needs_output)
UseSelectOnOutputSocket();

but you can also do something like waiting on a Latch but *also* on other fds.

If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...

Ah sorry, patch with xlogreader.h attached.

Will look at it in a second.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#20)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#21)
#24Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#24)
#26Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#22)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#23)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#28)
#31Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#29)
#32Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#33Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#32)
#34md@rpzdesign.com
md@rpzdesign.com
In reply to: Andres Freund (#33)
In reply to: Andres Freund (#9)
#36Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#35)
#37Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#36)
#38Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#35)
#41Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
#43Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#35)
#45Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#43)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#43)
#47Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#40)
#48Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#43)
#49Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#46)
#50Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#49)
#51Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#33)
#52Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#46)
#53Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#40)
#54Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#51)
#55Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#45)
#56Chris Browne
cbbrowne@acm.org
In reply to: Bruce Momjian (#45)
#57Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#43)
#58Steve Singer
steve@ssinger.info
In reply to: Josh Berkus (#55)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#51)
#60Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#59)
#61Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#60)
#62Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#61)
#63Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#54)
#64Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#63)
#65Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#59)
#66Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#64)
#67Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#62)
#68Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#59)
In reply to: Bruce Momjian (#61)
#70Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#69)
#71Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#69)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#70)
#73Chris Browne
cbbrowne@acm.org
In reply to: Peter Geoghegan (#69)
#74Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#72)
#75Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#72)
#76Andres Freund
andres@anarazel.de
In reply to: Chris Browne (#73)
#77Chris Browne
cbbrowne@acm.org
In reply to: Andres Freund (#76)
#78Andres Freund
andres@anarazel.de
In reply to: Chris Browne (#77)
#79Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Josh Berkus (#55)
#80Steve Singer
steve@ssinger.info
In reply to: Andres Freund (#76)
#81Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#35)
#82Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#75)
#83Jan Wieck
JanWieck@Yahoo.com
In reply to: Andres Freund (#71)
In reply to: Jan Wieck (#82)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#81)
#86Chris Browne
cbbrowne@acm.org
In reply to: Peter Geoghegan (#84)
#87Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#85)
In reply to: Chris Browne (#86)
#89Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#27)
#90Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#87)
#91Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#90)
In reply to: Andres Freund (#81)
#93Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#92)
#94Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#1)
#95Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#5)
#96Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#25)
#97Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#96)
#98Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#96)
#99Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#98)
#100Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#99)
#101Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#100)
#102Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#101)
#103Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#102)
#104Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#102)
#105Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#95)
#106Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#105)
#107Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#106)
#108Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#107)
#109Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#105)
#110Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#109)
#111Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#109)