[RFC][PATCH] wal decoding, attempt #2
Hi,
It took me far longer than I planned, its not finished, but time is running
out. I would like some feedback that I am not going astray at this point...
*I* think the general approach is sound and a good way forward that provides
the basic infrastructure for many (all?) of the scenarios we talked about
before.
Anyway, here is my next attempt at $TOPIC.
Lets start with a quick demo (via psql):
/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';
DROP TABLE IF EXISTS replication_example;
SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);
BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);
commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
(somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);
SELECT pg_current_xlog_insert_location();
---- Somewhat later ----
SELECT decode_xlog('0/1893D78', '0/18BE398');
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING: COMMIT
decode_xlog
-------------
t
(1 row)
As you can see the patchset can decode several changes made to a table even
though we used DDL on it. Not everything is handled yet, but its a prototype
after all ;)
The way this works is:
A new component called SnapshotBuilder analyzes the xlog and build a special
kind of Snapshot. This works in a somewhat similar way to the
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal
SnapshotNow semantics used for catalog access with one of the previously built
snapshots. They should behave just the same as a normal SnapshotNow would have
behaved when the tuple change was written to the xlog.
This patch doesn't provide anything that uses the new infrastructure for
anything real, but I think thats good. Lets get this into something
committable and then add new things using it!
Small overview over the individual patches that will come as separate mails:
old, Alvaro is doing this properly right now, separate thread
[01]: Add embedded list interface (header only)
A new piece of infrastructure (for k-way mergesort), pretty much untested,
good idea in general I think, not very interesting:
[02]: Add minimal binary heap implementation
Boring, old.:
[03]: Add support for a generic wal reading facility dubbed XLogReader
Boring, old, borked:
[04]: add simple xlogdump tool
Slightly changed to use (tablespace, relfilenode), possibly similar problems
to earlier, not interesting at this point.
[05]: Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)
relfilenode)
Unchanged:
[06]: Log enough data into the wal to reconstruct logical changes from it if wal_level=logical
wal_level=logical
I didn't implement proper cache handling, so I need to use the big hammer...:
[07]: Make InvalidateSystemCaches public
The major piece:
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
[08]: has loads of defficiencies. To cite the commit: The snapshot building has the most critical infrastructure but misses several important features: * loads of docs about the internals * improve snapshot building/distributions * don't build them all the time, cache them * don't increase ->xmax so slowly, its inefficient * refcount * actually free them * proper cache handling * we can probably reuse xl_xact_commit->nmsgs * generate new local inval messages from catalog changes? * handle transactions with both ddl, and changes * command_id handling * combocid loggin/handling * Add support for declaring tables as catalog tables that are not pg_catalog.* * properly distribute new SnapshotNow snapshots after a transaction commits * loads of testing/edge cases * provision of a consistent snapshot for pg_dump * spill state to disk at checkpoints * xmin handling
The snapshot building has the most critical infrastructure but misses
several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not
pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction
commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling
The decode_xlog() function is *purely* a debugging tool that I do not want to
keep in the long run. I introduced it so we can concentrate on the topic at
hand without involving even more moving parts (see the next paragraph)...
Some parts of this I would like to only discuss later, in separate threads, to
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal
I have some work ontop of this, that handles ComboCid's and CommandId's
correctly (and thus mixed ddl/dml transactions), but its simply not finished
enough. I am pretty sure by now that it works even with those additional
complexities.
So, I am unfortunately too tired to write more than this... It will have to
suffice. I plan to release a newer version with more documentation soon.
Comments about the approach or even the general direction of the
implementation? Questions?
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.
Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions
---
src/include/utils/ilist.h | 253 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 253 insertions(+)
create mode 100644 src/include/utils/ilist.h
Attachments:
0001-Add-embedded-list-interface-header-only.patchtext/x-patch; name=0001-Add-embedded-list-interface-header-only.patchDownload+253-0
This is basically untested.
---
src/backend/lib/Makefile | 2 +-
src/backend/lib/simpleheap.c | 255 +++++++++++++++++++++++++++++++++++++++++++
src/include/lib/simpleheap.h | 91 +++++++++++++++
3 files changed, 347 insertions(+), 1 deletion(-)
create mode 100644 src/backend/lib/simpleheap.c
create mode 100644 src/include/lib/simpleheap.h
Attachments:
0002-Add-minimal-binary-heap-implementation.patchtext/x-patch; name=0002-Add-minimal-binary-heap-implementation.patchDownload+347-1
Features:
- streaming reading/writing
- filtering
- reassembly of records
Reusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.
Missing:
- "compressing" the stream when removing uninteresting records
- writing out correct CRCs
- separating reader/writer
---
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlogreader.c | 1032 +++++++++++++++++++++++++++++++
src/include/access/xlogreader.h | 264 ++++++++
3 files changed, 1297 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogreader.c
create mode 100644 src/include/access/xlogreader.h
Attachments:
0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchtext/x-patch; name=0003-Add-support-for-a-generic-wal-reading-facility-dubbe.patchDownload+1297-1
---
src/bin/Makefile | 2 +-
src/bin/xlogdump/Makefile | 25 ++++
src/bin/xlogdump/xlogdump.c | 334 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 360 insertions(+), 1 deletion(-)
create mode 100644 src/bin/xlogdump/Makefile
create mode 100644 src/bin/xlogdump/xlogdump.c
Attachments:
0004-add-simple-xlogdump-tool.patchtext/x-patch; name=0004-add-simple-xlogdump-tool.patchDownload+360-1
This patch is problematic because formally indexes used by syscaches needs to
be unique, this one is not though because of 0/InvalidOids relfilenode entries
for nailed/shared catalog entries. Those values cannot be sensibly queries from
the catalog anyway though (the relmapper infrastructure needs to be used).
It might be nicer to add infrastructure to do this properly, I just don't have
a clue what the best way for this would be.
---
src/backend/utils/cache/syscache.c | 11 +++++++++++
src/include/catalog/indexing.h | 2 ++
src/include/catalog/pg_proc.h | 1 +
src/include/utils/syscache.h | 1 +
4 files changed, 15 insertions(+)
Attachments:
0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchtext/x-patch; name=0005-Add-a-new-syscache-to-fetch-a-pg_class-entry-via-rel.patchDownload+15-0
This adds a new wal_level value 'logical'
Missing cases:
- heap_multi_insert
- primary key changes for updates
- no primary key
- LOG_NEWPAGE
---
src/backend/access/heap/heapam.c | 135 +++++++++++++++++++++++++++++---
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/index.c | 74 +++++++++++++++++
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/access/xlog.h | 3 +-
src/include/catalog/index.h | 4 +
6 files changed, 207 insertions(+), 12 deletions(-)
Attachments:
0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchtext/x-patch; name=0006-Log-enough-data-into-the-wal-to-reconstruct-logical-.patchDownload+207-12
Pieces of this are in commit: make relfilenode lookup (tablespace, relfilenode
---
src/backend/utils/cache/inval.c | 2 +-
src/include/utils/inval.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
Attachments:
0007-Make-InvalidateSystemCaches-public.patchtext/x-patch; name=0007-Make-InvalidateSystemCaches-public.patchDownload+3-1
This introduces several things:
* applycache module which reassembles transactions from a stream of interspersed changes
* snapbuilder which builds catalog snapshots so that tuples from wal can be understood
* wal decoding into an applycache
* decode_xlog(lsn, lsn) debugging function
The applycache provides 3 major callbacks:
* apply_begin
* apply_change
* apply_commit
It is missing several parts:
- spill-to-disk
- resource usage controls
- command id handling
- passing of the correct mvcc snapshot (already has it, just doesn't pass)
The snapshot building has the most critical infrastructure but misses several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling
The xlog decoding also misses several parts:
- HEAP_NEWPAGE support
- HEAP2_MULTI_INSERT support
- handling of table rewrites
---
src/backend/replication/Makefile | 2 +
src/backend/replication/logical/Makefile | 19 +
src/backend/replication/logical/applycache.c | 574 +++++++++++++
src/backend/replication/logical/decode.c | 366 +++++++++
src/backend/replication/logical/logicalfuncs.c | 237 ++++++
src/backend/replication/logical/snapbuild.c | 1045 ++++++++++++++++++++++++
src/backend/utils/time/tqual.c | 161 ++++
src/include/access/transam.h | 5 +
src/include/catalog/pg_proc.h | 3 +
src/include/replication/applycache.h | 239 ++++++
src/include/replication/decode.h | 26 +
src/include/replication/snapbuild.h | 119 +++
src/include/utils/tqual.h | 21 +-
13 files changed, 2816 insertions(+), 1 deletion(-)
create mode 100644 src/backend/replication/logical/Makefile
create mode 100644 src/backend/replication/logical/applycache.c
create mode 100644 src/backend/replication/logical/decode.c
create mode 100644 src/backend/replication/logical/logicalfuncs.c
create mode 100644 src/backend/replication/logical/snapbuild.c
create mode 100644 src/include/replication/applycache.h
create mode 100644 src/include/replication/decode.h
create mode 100644 src/include/replication/snapbuild.h
Attachments:
0008-Introduce-wal-decoding-via-catalog-timetravel.patchtext/x-patch; name=0008-Introduce-wal-decoding-via-catalog-timetravel.patchDownload+2816-1
Hi,
A last note:
A git tree of this is at
git://git.postgresql.org/git/users/andresfreund/postgres.git branch xlog-
decoding-rebasing-cf2
checkout with:
git clone --branch xlog-decoding-rebasing-cf2
git://git.postgresql.org/git/users/andresfreund/postgres.git
Webview:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-
decoding-rebasing-cf2
That branch will be regularly rebased to a new master,fixes/new features, and
a pgindent run over the new files...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Saturday, September 15, 2012 03:14:32 AM Andres Freund wrote:
That branch will be regularly rebased to a new master,fixes/new features,
and a pgindent run over the new files...
I fixed up the formatting of the new stuff (xlogreader, ilist are submitted
separately, no point in doing anything there).
pushed to the repo mentioned upthread.
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Now that I proposed a new syscache upthread its easily possible to provide
pg_relation_by_filenode which I wished for multiple times in the past when
looking at filesystem activity and wondering which table does what. You can
sortof get the same result via
SELECT oid FROM (
SELECT oid, pg_relation_filenode(oid::regclass) filenode
FROM pg_class WHERE relkind != 'v'
) map
WHERE map.filenode = ...;
but thats neither efficient nor obvious.
So, two patches to do this:
Did others need this in the past? I can live with the 2nd patch living in a
private extension somewhere. The first one would also be useful for some
error/debug messages during decoding...
Greetings,
Andres
---
src/backend/utils/cache/relmapper.c | 53 +++++++++++++++++++++++++++++++++++++
src/include/utils/relmapper.h | 2 ++
2 files changed, 55 insertions(+)
Attachments:
0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchtext/x-patch; name=0001-Add-a-new-relmapper.c-function-RelationMapFilenodeTo.patchDownload+55-0
This requires the previously added RELFILENODE syscache.
---
doc/src/sgml/func.sgml | 23 ++++++++++++-
src/backend/utils/adt/dbsize.c | 78 ++++++++++++++++++++++++++++++++++++++++++
src/include/catalog/pg_proc.h | 2 ++
src/include/utils/builtins.h | 1 +
4 files changed, 103 insertions(+), 1 deletion(-)
Attachments:
0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchtext/x-patch; name=0002-Add-a-new-function-pg_relation_by_filenode-to-lookup.patchDownload+103-1
Andres Freund <andres@2ndquadrant.com> writes:
This requires the previously added RELFILENODE syscache.
[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).
regards, tom lane
On Monday, September 17, 2012 12:35:32 AM Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
This requires the previously added RELFILENODE syscache.
[ raised eyebrow... ] There's a RELFILENODE syscache? I don't see one,
and I doubt it would work given that the contents of
pg_class.relfilenode aren't unique (the zero entries are the problem).
Well, one patch upthread ;). It mentions the problem of it not being unique due
to relfilenode in (reltablespace, relfilenode) being 0 for shared/nailed
catalogs.
I am not really sure yet how big a problem for the caching infrastructure it is
that values that shouldn't ever get queried (because the relfilenode is
actually different) are duplicated. Reading code about all that atm.
Robert suggested writing a specialized cache akin to whats done in
attoptcache.c or such.
I haven't formed an opinion on whats the way forward on that topic. But anyway,
I don't see how the wal decoding stuff can progress without some variant of
that mapping, so I sure hope I/we can build something. Changing that aspect of
the patch should be trivial...
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 15.09.2012 03:39, Andres Freund wrote:
Features:
- streaming reading/writing
- filtering
- reassembly of recordsReusing the ReadRecord infrastructure in situations where the code that wants
to do so is not tightly integrated into xlog.c is rather hard and would require
changes to rather integral parts of the recovery code which doesn't seem to be
a good idea.
My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog. 2. We should focus on
reading WAL, I don't see the point of mixing WAL writing into this. 3. I
don't like the callback-style API.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.
There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.
So the usage is like this:
/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}
state = XLogReaderAllocate(&MyXLogPageRead);
while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}
XLogReaderFree(state);
- Heikki
Attachments:
xlogreader-heikki-1.patchtext/x-diff; name=xlogreader-heikki-1.patchDownload+574-508
Hi Heikki,
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
Features:
- streaming reading/writing
- filtering
- reassembly of recordsReusing the ReadRecord infrastructure in situations where the code that
wants to do so is not tightly integrated into xlog.c is rather hard and
would require changes to rather integral parts of the recovery code
which doesn't seem to be a good idea.My previous objections to this approach still apply. 1. I don't want to
maintain a second copy of the code to read xlog.
Yes. I aggree. And I am willing to provide an implementation of this if should
my xlogreader variant gets a bit more buyin.
2. We should focus on reading WAL, I don't see the point of mixing WAL
writing into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.
Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.
3. I don't like the callback-style API.
I tried to accomodate to that by providing:
extern XLogRecordBuffer* XLogReaderReadOne(XLogReaderState* state);
which does exactly that.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.
The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.
There is still one callback, XLogPageRead(), to obtain a given page in
WAL. The XLogReader facility is responsible for decoding the WAL into
records, but the user of the facility is responsible for supplying the
physical bytes, via the callback.
Makes sense.
So the usage is like this:
/*
* Callback to read the page starting at 'RecPtr' into *readBuf. It's
* up to you to do this any way you like. Typically you'd read from a
* file. The WAL recovery implementation of this in xlog.c is more
* complicated. It checks the archive, waits for streaming replication
* etc.
*/
static bool
MyXLogPageRead(XLogReaderState *xlogreader, XLogRecPtr RecPtr, char
*readBuf, void *private_data)
{
...
}state = XLogReaderAllocate(&MyXLogPageRead);
while ((record = XLogReadRecord(state, ...)))
{
/* do something with the record */
}
If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...
Do my aims make any sense to you?
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WALwriting into this.
If you write something that filters/analyzes and then forwards WAL and you want
to do that without a big overhead (i.e. completely reassembling everything, and
then deassembling it again for writeout) its hard to do that without
integrating both sides.
It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.
Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.
Also, I want to read records incrementally/partially just as data comes in
which again is hard to combine with writing out the data again.
You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated. I'd suggest keeping
it simple for now, and optimize later if necessary. Note that before you
have the whole WAL record, you cannot CRC check it, so you don't know if
it's in fact a valid WAL record.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.The problem with that is that kind of API is that it, at least as far as I can
see, that it never can operate on incomplete/partial input. Your need to buffer
larger amounts of xlog somewhere and you need to be aware of record boundaries.
Both are things I dislike in a more generic user than xlog.c.
I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.
If you don't want the capability to forward/filter the data and read partial
data without regard for record constraints/buffering your patch seems to be
quite a good start. It misses xlogreader.h though...
Ah sorry, patch with xlogreader.h attached.
- Heikki
Attachments:
xlogreader-heikki-2.patchtext/x-diff; name=xlogreader-heikki-2.patchDownload+675-508
On Monday, September 17, 2012 10:30:35 AM Heikki Linnakangas wrote:
On 17.09.2012 11:12, Andres Freund wrote:
On Monday, September 17, 2012 09:40:17 AM Heikki Linnakangas wrote:
On 15.09.2012 03:39, Andres Freund wrote:
2. We should focus on reading WAL, I don't see the point of mixing WALwriting into this.
If you write something that filters/analyzes and then forwards WAL and
you want to do that without a big overhead (i.e. completely reassembling
everything, and then deassembling it again for writeout) its hard to do
that without integrating both sides.It seems really complicated to filter/analyze WAL records without
reassembling them, anyway. The user of the facility is in charge of
reading the physical data, so you can still access the raw data, for
forwarding purposes, in addition to the reassembled records.
It works ;)
Or what exactly do you mean by "completely deassembling"? I read that to
mean dealing with page boundaries, ie. if a record is split across
pages, copy parts into a contiguous temporary buffer.
Well, if you want to fully split reading and writing of records - which is a
nice goal! - you basically need the full logic of XLogInsert again to split
them apart again to write them. Alternatively you need to store record
boundaries somewhere and copy that way, but in the end if you filter you need
to correct CRCs...
Also, I want to read records incrementally/partially just as data comes
in which again is hard to combine with writing out the data again.You mean, you want to start reading the first half of a record, before
the 2nd half is available? That seems complicated.
Well, I just can say again: It works ;). Makes it easy to follow something like
XLogwrtResult without taking care about record boundaries.
I'd suggest keeping it simple for now, and optimize later if necessary.
Well, yes. The API should be able to comfortably support those cases though
which I don't think is neccesarily the case in a simple, one call API as
proposed.
Note that before you have the whole WAL record, you cannot CRC check it, so
you don't know if it's in fact a valid WAL record.
Sure. But you can start the CRC computation without any problems and finish it
when the last part of the data comes in.
I came up with the attached. I moved ReadRecord and some supporting
functions from xlog.c to xlogreader.c, and made it operate on
XLogReaderState instead of global global variables. As discussed before,
I didn't like the callback-style API, I think the consumer of the API
should rather just call ReadRecord repeatedly to get each record. So
that's what I did.The problem with that is that kind of API is that it, at least as far as
I can see, that it never can operate on incomplete/partial input. Your
need to buffer larger amounts of xlog somewhere and you need to be aware
of record boundaries. Both are things I dislike in a more generic user
than xlog.c.I don't understand that argument. A typical large WAL record is split
across 1-2 pages, maybe 3-4 at most, for an index page split record.
That doesn't feel like much to me. In extreme cases, a WAL record can be
much larger (e.g a commit record of a transaction with a huge number of
subtransactions), but that should be rare in practice.
Well, imagine something like the walsender that essentially follows the flush
position ideally without regard for record boundaries. It is nice to be able to
send/analyze/filter as soon as possible without waiting till a page is full.
And it sure would be nice to be able to read the data on the other side
directly from the network, decompress it again, and only then store it to disk.
The user of the facility doesn't need to be aware of record boundaries,
that's the responsibility of the facility. I thought that's exactly the
point of generalizing this thing, to make it unnecessary for the code
that uses it to be aware of such things.
With the proposed API it seems pretty much a requirement to wait inside the
callback. Thats not really nice if your process has other things to wait for as
well.
In my proposal you can simply do something like:
XLogReaderRead(state);
DoSomeOtherWork();
if (CheckForForMessagesFromWalreceiver())
ProcessMessages();
else if (state->needs_input)
UseLatchOrSelectOnInputSocket();
else if (state->needs_output)
UseSelectOnOutputSocket();
but you can also do something like waiting on a Latch but *also* on other fds.
If you don't want the capability to forward/filter the data and read
partial data without regard for record constraints/buffering your patch
seems to be quite a good start. It misses xlogreader.h though...Ah sorry, patch with xlogreader.h attached.
Will look at it in a second.
Greetings,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services