Catalog/Metadata consistency during changeset extraction from wal

Started by Andres Freundalmost 14 years ago39 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi Robert, Hi all,

Robert and I talked quite a bit about different methods of providing enough
information to extract tuples from wal. I don't think either of us is yet
really convinced of any individual method, so I want to recap our discussion
in one email so others can chime in without reading the already huge thread.

I hope I am not misrepesenting Roberts opinion here, but I am sure he will
correct me if I do ;)

To satisfy the different needs people have for changeset extraction we
currently think that for each individual tuple extracted from wal we need to
provide the following information:

a) one or more pieces of tuple data (HeapTupleData structs)
* INSERT: full new tuple
* UPDATE: full new tuple, old pkey (optionally the full old tuple)
* DELETE: old pkey (optionally the full old tuple)
b) the action performed (INSERT|UPDATE|DELETE)
c) the table on which the action was performed
d) access to the table structure (names, column types, ...) procs (*_out
functions for the individual columns)

The problem with getting that data is that at the point were decoding the wal
the catalog may have evolved significantly from the state it was in when the
tuple was put into the wal.
We can extract a) and b) without any problems (lets not talk about it here)
but we don't necessarily know how to make sense of the data because a
HeapTuple cannot be properly interpreted without the knowledge of c) and d).

I am of the opinion that c) is basically equivalent to solving d) because the
wal only contains the tuple (pg_database.oid, pg_tablespace.oid,
pg_class.relfilenode) of the table and not the 'pg_class.oid'. The relfilenode
is changed by operations that rewrite the table like ALTER TABLE ADD COLUMN
... DEFAULT ...; TRUNCATE; CLUSTER and some others.

A single transaction can contain tuples for different relfilenodes and with
different columns:

CREATE TABLE foo(id serial primary key, data text);
BEGIN;
INSERT INTO foo ...;
TRUNCATE foo;
INSERT INTO foo ...; -- same structure, different relfilenode

ALTER TABLE foo ADD COLUMN bar text;
INSERT INTO foo ...; -- same relfilenode, different table structure

ALTER TABLE foo ADD COLUMN zaphod text DEFAULT '';
INSERT INTO foo ...; -- different relfilenode, different table structure
COMMIT;

There are several additional complex scenarios.

In http://archives.postgresql.org/message-
id/201206192023.20589.andres@2ndquadrant.com I listed which options I see for
reaching that goal.

A refined version of that list:

1.)
Decode on a different, possibly catalog-only, pg instance kept in sync using
the command trigger infrastructure (but not necessarily user-level defined
command triggers)

If the command/event trigger infrastructure logs into a system-catalog table
keeping the catalog in the correct state is relatively easy. When
replaying/converting a reassembled transaction everytime an INSERT into that
system table happens the contained DDL gets performed.
The locking on the generating side takes care of the concurrency aspects.

Advantages:
* minimal overhead (space, performance)
* allows additional tables/indexes/triggers if you take care with oid
allocation
* easy transactionally correct catalog behaviour behaviour
* the decoding instance can be used to store all data in a highly efficient
manner (no decoding, no full detoasting, ...)
* the decoding instance is fully writable without problems if you don't
generate conflicts (separate tables, non-overlapping writes, whatever)
* implementable in a pretty unintrusive way

Disadvantes:
* the table structure of replicated tables needs to be exactly the same
* the type definition + support procs needs to be similar enough to read the
data
* error checking of the above isn't easy but probably possible
* full version/architecture compatibility required
* a proxy instance required even if you want to replicate into some other
system/architecture/version

2.)
Keep the decoding site up2date by replicating the catalog via normal HS
recovery
mechanisms.

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync,
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information
to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable

3)
Multi-Versioned catalog

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back
to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what people
want performancewise

3a)
Change the system catalogs to be versioned

Advantages.
* catalog access is easy
* might be interesting for other users

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off

3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as the
tuple in question has seen it.

Advantages:
* should be implementable with low impact to general code

Disadvantages:
* requires some complex code for assembling snapshots
* it might be hard to guarantee that we always have enough information to
reassemble a snapshot (subxid overflows ...)
* impacts vacuum if replication to some site is slow

4.)
Log enough information in the walstream to make decoding possible using only
the walstream.

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog information
needs to be logged in a transactional manner repeatedly
* misuses wal far more than other methods
* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible architecture
independent

5.)
The actually good idea. Yours?

-----

I think 3a) is not likely to fly and I think 4) is too complex although Robert
isn't convinced of the latter argument.

In my opinion either 3b) or 1) are our best options because they seem to
support most of the usecases without huge costs in complexity and
runtime/space for users not using changeset extraction. 3b) seems to be
preferrable because imo its the most flexible choice and doesn't require a
second instance.

Any arguments against/for those?

I am going to talk about implementing 3b) in a separate email in a bit.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#2Florian Pflug
fgp@phlo.org
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Jun21, 2012, at 13:41 , Andres Freund wrote:

3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as the
tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
updated multiple times by the same transaction, you cannot decide whether a
tuple was visible in a certain snapshot unless you have access to the updating
backend's ComboCID hash.

best regards,
Florian Pflug

#3Florian Pflug
fgp@phlo.org
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Jun21, 2012, at 13:41 , Andres Freund wrote:

5.)
The actually good idea. Yours?

What about a mixure of (3b) and (4), which writes the data not to the WAL
but to a separate logical replication log. More specifically:

There's a per-backend queue of change notifications.

Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
record containing (xid, databaseoid, tableoid, old xmin, old ctid, new ctid)

Whenever a table (or something that a table depends on) is modified we
wait until all references to that table's oid have vanished from the queue,
then queue a DDL record containing (xid, databaseoid, tableoid, text). Other
backend cannot concurrently add further TUPLE_MODIFIED records since we alreay
hold an exclusive lock on the table at that point.

A background process continually processes these queues. If the front of the
queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
based on their ctids and writes the old tuple's PK and the full new tuple
to the logical replication log. Since table modifications always wait for
all previously queued TUPLE_MODIFIED records referencing that table to be
processes *before* altering the catalog, tuples can always be interpreted
according to the current (SnapshotNow) catalog contents.

Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
which are also written to the log by the background process. The background
process may decide to wait until a backend commits before processing that
backend's log. In that case, rolled back transaction don't leave a trace in
the logical replication log. Should a backend, however, issue a DDL statement,
the background process *must* process that backend's queue immediately, since
otherwise there's a dead lock.

The background process also maintains a value in shared memory which
contains the oldest value in any of the queue's xid or "old xmin" fields.
VACUUM and the like must not remove tuples whose xmin is >= that value.
Hit bits *may* be set for newest tuples though, provided that the background
process ignores hint bits when fetching the old and new tuples.

best regards,
Florian Pflug

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote:

3)
Multi-Versioned catalog

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in that
path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling back
to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what people
want performancewise

3a)
Change the system catalogs to be versioned

Advantages.
* catalog access is easy
* might be interesting for other users

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off

Hmm, there's all sorts of stuff mixed up there in your description.

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

There is no need to version the whole catalog. (Complete overkill - I
would oppose it ;-)

If we keep the lookup table on the target as a normal table, we can
insert new rows into it as changes occur. If we need to perform
recovery then the earlier version rows will still be there and we just
use those. Versioning is easy to implement, just use LSN as additional
key in the table. Then lookup based on key and LSN. If a transaction
that makes DDL changes aborts, then the changes will be automatically
backed out.

Only keep the lookup table if using logical replication, so zero
overhead otherwise. We just need to setup the initial state carefully,
so it matches whats in the database, but that sounds OK.

So I don't see any of the disadvantages you have there. Its just darn
simple, and hence will probably work. It's also a very similar
solution to the other lookups required in memory by the apply process.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#4)
Re: Catalog/Metadata consistency during changeset extraction from wal

Hi,

On Thursday, June 21, 2012 04:39:21 PM Simon Riggs wrote:

On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote:

3)
Multi-Versioned catalog

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in
that path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling
back to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what
people want performancewise

3a)
Change the system catalogs to be versioned

Advantages.
* catalog access is easy
* might be interesting for other users

Disadvantages:
* catalog versioning is complex to implement
* space overhead for all users, even without using logical replication
* I can't see -hackers signing off

Hmm, there's all sorts of stuff mixed up there in your description.

Sure, it tried to compress a complex topic discussed in a long thread ;)

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

You need just about the whole catalog because the *_out procs might need to
lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a normal
execution environment.

I don't see how your idea works because of that? Am I missing something?

Yes, that would be easier if we didn't want to support conversion to text and
similar, but I don't see that flying. And even if it would be acceptable you
would need to have enough information to construct a btree ScanKey which means
you already need a lot of the catalogs.

There is no need to version the whole catalog. (Complete overkill - I
would oppose it ;-)

Hey, that originally was your idea :P. But I definitely agree, its not a good
idea.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#5)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 21 June 2012 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

You need just about the whole catalog because the *_out procs might need to
lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a normal
execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.

I don't see how your idea works because of that? Am I missing something?

Why does the number/size of the tables required make that not work?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#7Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#6)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:

On 21 June 2012 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

You need just about the whole catalog because the *_out procs might need
to lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a normal
execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.

Sure, there are a few you probably can leave out (pg_database, pg_auth*,
pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

I don't see how your idea works because of that? Am I missing something?

Why does the number/size of the tables required make that not work?

The number of tables itself isn't a fundamental problem although it would make
stuff harder.
The problem is that the out functions expect a normal operating environment
and might e.g. do catalog lookups themselves. I don't see how we can do
anything here without providing a (nearly) full catalog.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#7)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 21 June 2012 16:13, Andres Freund <andres@2ndquadrant.com> wrote:

On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:

On 21 June 2012 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

You need just about the whole catalog because the *_out procs might need
to lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a normal
execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.

Sure, there are a few you probably can leave out (pg_database, pg_auth*,
pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

That's a start. Leaving out the shared catalogs makes me smile already.

I don't see how your idea works because of that? Am I missing something?

Why does the number/size of the tables required make that not work?

The number of tables itself isn't a fundamental problem although it would make
stuff harder.
The problem is that the out functions expect a normal operating environment
and might e.g. do catalog lookups themselves. I don't see how we can do
anything here without providing a (nearly) full catalog.

I accept that there could be pathological functions in there. We're
not trying to make it work with any conceivable datatype/operator, so
forcing logical replication to follow sensible rules makes sense. Are
there any out functions that anybody uses that do that?

It's too much change to actually version the main catalog. Keeping a
separate copy of a versioned catalog for use by replication sounds
much more likely to fly.

In any case, I think we'll have to go back through the list and do
more work on evaluation. When the options look like that, its typical
to have ruled out the final winner early on, but that doesn't mean it
isn't in there somewhere.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#9Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#8)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 05:25:41 PM Simon Riggs wrote:

On 21 June 2012 16:13, Andres Freund <andres@2ndquadrant.com> wrote:

On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:

On 21 June 2012 15:53, Andres Freund <andres@2ndquadrant.com> wrote:

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

You need just about the whole catalog because the *_out procs might
need to lookup types, operators and such again.
Unless you want to rewrite those functions you need to provide a
normal execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.

Sure, there are a few you probably can leave out (pg_database, pg_auth*,
pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not
many.

That's a start. Leaving out the shared catalogs makes me smile already.

I don't see how your idea works because of that? Am I missing
something?

Why does the number/size of the tables required make that not work?

The number of tables itself isn't a fundamental problem although it would
make stuff harder.
The problem is that the out functions expect a normal operating
environment and might e.g. do catalog lookups themselves. I don't see
how we can do anything here without providing a (nearly) full catalog.

I accept that there could be pathological functions in there. We're
not trying to make it work with any conceivable datatype/operator, so
forcing logical replication to follow sensible rules makes sense. Are
there any out functions that anybody uses that do that?

Loads. enum_out, record_out, array_out are examples I can think of without
even looking. I am pretty sure there are more. But imo this list already shows
its prohibitive.

It's too much change to actually version the main catalog. Keeping a
separate copy of a versioned catalog for use by replication sounds
much more likely to fly.

I don't yet see how that should work given oids and everything are quite
possibly hardcoded in those functions. You could start switching out the
catalogs on a lower level but I think at that point its getting too ugly.

In any case, I think we'll have to go back through the list and do
more work on evaluation. When the options look like that, its typical
to have ruled out the final winner early on, but that doesn't mean it
isn't in there somewhere.

I hope we have but I am not convinced that there is an elegant solution...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Andres Freund
andres@anarazel.de
In reply to: Florian Pflug (#2)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:

On Jun21, 2012, at 13:41 , Andres Freund wrote:

3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as
the tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
updated multiple times by the same transaction, you cannot decide whether a
tuple was visible in a certain snapshot unless you have access to the
updating backend's ComboCID hash.

Thats a very good point. Not sure how I forgot that.

It think it might be possible to reconstruct a sensible combocid mapping from
the walstream. Let me think about it for a while...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11Andres Freund
andres@anarazel.de
In reply to: Florian Pflug (#3)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 04:05:54 PM Florian Pflug wrote:

On Jun21, 2012, at 13:41 , Andres Freund wrote:

5.)
The actually good idea. Yours?

What about a mixure of (3b) and (4), which writes the data not to the WAL
but to a separate logical replication log. More specifically:

There's a per-backend queue of change notifications.

Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
record containing (xid, databaseoid, tableoid, old xmin, old ctid, new
ctid)

Whenever a table (or something that a table depends on) is modified we
wait until all references to that table's oid have vanished from the queue,
then queue a DDL record containing (xid, databaseoid, tableoid, text).
Other backend cannot concurrently add further TUPLE_MODIFIED records since
we alreay hold an exclusive lock on the table at that point.

A background process continually processes these queues. If the front of
the queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
based on their ctids and writes the old tuple's PK and the full new tuple
to the logical replication log. Since table modifications always wait for
all previously queued TUPLE_MODIFIED records referencing that table to be
processes *before* altering the catalog, tuples can always be interpreted
according to the current (SnapshotNow) catalog contents.

Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
which are also written to the log by the background process. The background
process may decide to wait until a backend commits before processing that
backend's log. In that case, rolled back transaction don't leave a trace in
the logical replication log. Should a backend, however, issue a DDL
statement, the background process *must* process that backend's queue
immediately, since otherwise there's a dead lock.

The background process also maintains a value in shared memory which
contains the oldest value in any of the queue's xid or "old xmin" fields.
VACUUM and the like must not remove tuples whose xmin is >= that value.
Hit bits *may* be set for newest tuples though, provided that the
background process ignores hint bits when fetching the old and new tuples.

I think thats too complicated to fly. Getting that to recover cleanly in case
of crash would mean you'd need another wal.

I think if it comes to that going for 1) is more realistic...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

Andres Freund Sent: Thursday, June 21, 2012 5:11 PM

4.)
Log enough information in the walstream to make decoding possible using

only

the walstream.

What I understood is that enough information is catalog data. Is that right
or something else?

Advantages:
* Decoding can optionally be done on the master
* No catalog syncing/access required
* its possible to make this architecture independent

Disadvantage:
* high to very high implementation overhead depending on efficiency aims
* high space overhead in the wal because at least all the catalog

information

In Multiversion approach also, there will be overhead of space to
maintain multiple versions
irrespective of any approach you use.

needs to be logged in a transactional manner repeatedly

Why it needs to be logged repeatedly, once we log the catalog
information in WAL, during that
time we can disallow/block other DDL's and after that changes to Catalog
information can be
retrievied from WAL only.

* misuses wal far more than other methods

What is the misuse in this, I believe it can be later used for log
mining purposes also.

* significant new complexity in somewhat cricital code paths (heapam.c)
* insanely high space overhead if the decoding should be possible

architecture

independent

The option 4 seems to be better as compare to others w.r.t top level
approach to solve the problem.
Some other databases also uses similar approach for the use cases similar to
what you have described.

With Regards,
Amit Kapila.

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote:

2.)
Keep the decoding site up2date by replicating the catalog via normal HS
recovery
mechanisms.

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in sync,
including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate code
as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough information
to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable

My initial reaction was "this wont work", but that requires
qualification since this is a complex topic: You can use this
approach as long as you realise that the catalog it gives can never be
rewound.

So the generic approach to "construct me a catalog as of this LSN"
would need to start with a base backup of the catalog and then roll
forward to the appropriate LSN. Which means a generic user of this
approach would need to be able to construct an initial catalog using a
PITR.

Constructing a decoding site requires you to
a) take a partial base backup of the catalog
b) apply WAL records to bring that forwards to the correct LSN, which
would require some alteration of the recovery code to skip the files
missing in a)

So taking the approach of a decoding site means we have to modify
recovery code, and even when we do that we still end up with a
difficult to deploy option in the real world. Difficult to deploy
becaus we need a whole new instance of Postgres, plus we need all of
the WAL files, which could easily be impractical.

The overall approach is good, but the architecture is wrong. What we
need is a "catalog base backup" and a means of rolling forward to the
appropriate LSN. Rolling forward using WAL is too bulky, so we need a
separate log of DDL changes to the catalog. So what we need is a
"catalog base backup" plus a "ddl-log".

And we need to be able to reconstruct the correct catalog on the
target server directly.

To translate the WAL we maintain a secondary set of catalog tables,
which only exist for logical replication. These are normal,
non-versioned tables, but held in a new schema pg_logical or similar.
One reason why this must be a secondary catalog is to allow the
translation to take place on the target server, and to allow
translation of WAL from a prior version of the catalog - so we can
allow online upgrades across different catalog versions (and possibly
major versions).

The secondary catalog is used in place of the normal catalog during
InitCatalogCache() in the apply process. All the normal caches exist,
they just point to secondary relations rather than the normal ones.

When we initialise replication we take a copy of the appropriate
tables, columns and rows in a catalog-base-backup, using something
like pg_dump. Overall, this is much smaller than normal catalog since
it avoids temp tables, and anything not related to WAL translation.

On each non-temp change to the database we record changes as SQL in
the ddl-log, together with the LSN of the change.

When number of changes in ddl-log hits a limit we take a new
catalog-base-backup. This process is similar to a checkpoint, but much
less frequent, lets call it a ddl-checkpoint.

When we start to read WAL logs to translate them, we start by
truncating/re-bootstrapping and reloading the secondary catalog from
the base backup. We then apply all changes from the ddl-log (which is
just a sequence of SQL statements) up until the LSN at the start of
WAL. The secondary catalog is then an exact copy of the catalog as of
that LSN.

As we read through WAL we apply further changes to secondary catalog
so it maintains in lock step with the WAL we currently read.

Having the ddl-base-backup and ddl-log allows reconstruction of the
catalog without needing to put whole catalog into WAL each checkpoint.
We can truncate old WAL segments and yet still recreate the DDL needed
to translate current WAL data. As a result, ddl-checkpoints are much
less frequent, perhaps weekly or monthly, rather than every few
minutes.

The whole process is similar in algorithm to recovery, but is just
normal userspace tables and SQL.

Constructing the correct catalog seems to be the heart of this
problem, so it is likely to take a while and look complex. Getting the
HS initial state was around 50% of the effort in making it all work,
so I guess its similar here.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#14Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#13)
Re: Catalog/Metadata consistency during changeset extraction from wal

Hi,

On Friday, June 22, 2012 08:48:41 AM Simon Riggs wrote:

On 21 June 2012 12:41, Andres Freund <andres@2ndquadrant.com> wrote:

2.)
Keep the decoding site up2date by replicating the catalog via normal HS
recovery
mechanisms.

Advantages:
* most of the technology is already there
* minimal overhead (space, performance)
* no danger of out of sync catalogs
* no support for command triggers required that can keep a catalog in
sync, including oids

Disadvantages:
* driving the catalog recovery that way requires some somewhat intricate
code as it needs to be done in lockstep with decoding the wal-stream
* requires an additional feature to guarantee HS always has enough
information to be queryable after a crash/shutdown
* some complex logic/low-level fudging required to keep the transactional
behaviour sensible when querying the catalog
* full version/architecture compatibility required
* the decoding site will always ever be only readable

My initial reaction was "this wont work", but that requires
qualification since this is a complex topic: You can use this
approach as long as you realise that the catalog it gives can never be
rewound.

Well, only as far as the min recovery point has been advanced. Thats advanced
less frequent than we apply xlog records.

So the generic approach to "construct me a catalog as of this LSN"
would need to start with a base backup of the catalog and then roll
forward to the appropriate LSN. Which means a generic user of this
approach would need to be able to construct an initial catalog using a
PITR.

Constructing a decoding site requires you to
a) take a partial base backup of the catalog
b) apply WAL records to bring that forwards to the correct LSN, which
would require some alteration of the recovery code to skip the files
missing in a)

So taking the approach of a decoding site means we have to modify
recovery code, and even when we do that we still end up with a
difficult to deploy option in the real world. Difficult to deploy
becaus we need a whole new instance of Postgres, plus we need all of
the WAL files, which could easily be impractical.

The overall approach is good, but the architecture is wrong. What we
need is a "catalog base backup" and a means of rolling forward to the
appropriate LSN. Rolling forward using WAL is too bulky, so we need a
separate log of DDL changes to the catalog. So what we need is a
"catalog base backup" plus a "ddl-log".

The idea was to store the applycache to disk everytime UpdateMinRecoveryPoint
is called. That way you wouldn't have to scroll back, even if the database
crashes/is stopped hard.
But I agree, I don't like the architecture that much either.

To translate the WAL we maintain a secondary set of catalog tables,
which only exist for logical replication. These are normal,
non-versioned tables, but held in a new schema pg_logical or similar.
One reason why this must be a secondary catalog is to allow the
translation to take place on the target server, and to allow
translation of WAL from a prior version of the catalog - so we can
allow online upgrades across different catalog versions (and possibly
major versions).
The secondary catalog is used in place of the normal catalog during
InitCatalogCache() in the apply process. All the normal caches exist,
they just point to secondary relations rather than the normal ones.
When we initialise replication we take a copy of the appropriate
tables, columns and rows in a catalog-base-backup, using something
like pg_dump. Overall, this is much smaller than normal catalog since
it avoids temp tables, and anything not related to WAL translation.

On each non-temp change to the database we record changes as SQL in
the ddl-log, together with the LSN of the change.

When number of changes in ddl-log hits a limit we take a new
catalog-base-backup. This process is similar to a checkpoint, but much
less frequent, lets call it a ddl-checkpoint.

When we start to read WAL logs to translate them, we start by
truncating/re-bootstrapping and reloading the secondary catalog from
the base backup. We then apply all changes from the ddl-log (which is
just a sequence of SQL statements) up until the LSN at the start of
WAL. The secondary catalog is then an exact copy of the catalog as of
that LSN.

As we read through WAL we apply further changes to secondary catalog
so it maintains in lock step with the WAL we currently read.

I can't see how thats going to fly because the *_out functions use the
syscache and also plain access to catalog tables. We would have to completely
map oids to the alternative catalog.
For one I think that mapping would involve far too many places (shared
catalogs/relmapper. smgr, fd, syscache, ...). For another you need to access
those tables in a completely normal fashion from non-recovery backends which
means that we cannot just have duplicated oids hidden away somewhere.

Constructing the correct catalog seems to be the heart of this
problem, so it is likely to take a while and look complex. Getting the
HS initial state was around 50% of the effort in making it all work,
so I guess its similar here.

Yes.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#10)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:

On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:

On Jun21, 2012, at 13:41 , Andres Freund wrote:

3b)
Ensure that enough information in the catalog remains by fudging the
xmin horizon. Then reassemble an appropriate snapshot to read the
catalog as the tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a tuple
is updated multiple times by the same transaction, you cannot decide
whether a tuple was visible in a certain snapshot unless you have access
to the updating backend's ComboCID hash.

Thats a very good point. Not sure how I forgot that.

It think it might be possible to reconstruct a sensible combocid mapping
from the walstream. Let me think about it for a while...

I have a very, very preliminary thing which seems to work somewhat. I just log
(cmin, cmax) additionally for every modified catalog tuple into the wal and so
far that seems to be enough.
Do you happen to have suggestions for other problematic things to look into
before I put more time into it?

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#15)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Friday, June 22, 2012 03:22:03 PM Andres Freund wrote:

On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:

On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:

On Jun21, 2012, at 13:41 , Andres Freund wrote:

3b)
Ensure that enough information in the catalog remains by fudging the
xmin horizon. Then reassemble an appropriate snapshot to read the
catalog as the tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a
tuple is updated multiple times by the same transaction, you cannot
decide whether a tuple was visible in a certain snapshot unless you
have access to the updating backend's ComboCID hash.

Thats a very good point. Not sure how I forgot that.

It think it might be possible to reconstruct a sensible combocid mapping
from the walstream. Let me think about it for a while...

I have a very, very preliminary thing which seems to work somewhat. I just
log (cmin, cmax) additionally for every modified catalog tuple into the
wal and so far that seems to be enough.
Do you happen to have suggestions for other problematic things to look into
before I put more time into it?

Im continuing to play around with this. The tricky bit so far is
subtransaction handling in transactions which modify the catalog (+ possible
tables which are marked as being required for decoding like pg_enum
equivalent).

Would somebody fundamentally object to one the following things:
1.
replace

#define IsMVCCSnapshot(snapshot) \
((snapshot)->satisfies == HeapTupleSatisfiesMVCC)

with something like

#define IsMVCCSnapshot(snapshot) \
((snapshot)->satisfies == HeapTupleSatisfiesMVCC || (snapshot)->satisfies ==
HeapTupleSatisfiesMVCCDuringDecode)

The define is only used sparingly and none of the code path looks so hot that
this could make a difference.

2.
Set SnapshotNowData.satisfies to HeapTupleSatisfiesNowDuringRecovery while
reading the catalog for decoding.

Its possible to go on without both but the faking up of data gets quite a bit
more complex.

The problem making replacement of SnapshotNow.satisfies useful is that there is
no convenient way to represent subtransactions of the current transaction
which already have committed according to the TransactionLog but aren't yet
visible at the current lsn because they only started afterwards. Its
relatively easy to fake this in an mvcc snapshot but way harder for
SnapshotNow because you cannot mark transactions as in-progress.

Thanks,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Thursday, June 21, 2012 01:41:25 PM Andres Freund wrote:

Below are two possible implementation strategies for that concept

Advantages:
* Decoding is done on the master in an asynchronous fashion
* low overhead during normal DML execution, not much additional code in
that path
* can be very efficient if architecture/version are the same
* version/architecture compatibility can be done transparently by falling
back to textual versions on mismatch

Disadvantages:
* decoding probably has to happen on the master which might not be what
people want performancewise

3b)
Ensure that enough information in the catalog remains by fudging the xmin
horizon. Then reassemble an appropriate snapshot to read the catalog as
the tuple in question has seen it.

Advantages:
* should be implementable with low impact to general code

Disadvantages:
* requires some complex code for assembling snapshots
* it might be hard to guarantee that we always have enough information to
reassemble a snapshot (subxid overflows ...)
* impacts vacuum if replication to some site is slow

There are some interesting problems related to locking and snapshots here. Not
sure if they are resolvable:

We need to restrict SnapshotNow to represent to the view it had back when the
wal record were currently decoding had. Otherwise we would possibly get wrong
column types and similar. As were working in the past locking doesn't protect
us against much here. I have that (mostly and inefficiently).

One interesting problem are table rewrites (truncate, cluster, some ALTER
TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
had back when the wal record was created we get the old relfilenode. Which
might have been dropped in part of the transaction cleanup...
With most types thats not a problem. Even things like records and arrays
aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
pg_enum) and vacuum full'ing a table which is used in the *_out function of a
type (like a user level pg_enum implementation).

The only theoretical way I see against that problem would be to postpone all
relation unlinks untill everything that could possibly read them has finished.
Doesn't seem to alluring although it would be needed if we ever move more
things of SnapshotNow.

Input/Ideas/Opinions?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#17)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 24 June 2012 22:11, Andres Freund <andres@2ndquadrant.com> wrote:

One interesting problem are table rewrites (truncate, cluster, some ALTER
TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
had back when the wal record was created we get the old relfilenode. Which
might have been dropped in part of the transaction cleanup...
With most types thats not a problem. Even things like records and arrays
aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
pg_enum) and vacuum full'ing a table which is used in the *_out function of a
type (like a user level pg_enum implementation).

That's only a problem if you are generating changes to the relfilenode
rather than the relid.

ISTM that this step differs depending upon whether we are generating
portable SQL, or whether we are generating changes for immediate
apply. If it is the latter, then it should never actually happen
because if a table rewrite occurred and then committed we would never
need to re-read earlier WAL.

So treating this as a generic problem leads to some weird results that
we don't need to worry about cos they can't actually happen.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#16)
Re: Catalog/Metadata consistency during changeset extraction from wal

On 22 June 2012 20:30, Andres Freund <andres@2ndquadrant.com> wrote:

The problem making replacement of SnapshotNow.satisfies useful is that there is
no convenient way to represent subtransactions of the current transaction
which already have committed according to the TransactionLog but aren't yet
visible at the current lsn because they only started afterwards. Its
relatively easy to fake this in an mvcc snapshot but way harder for
SnapshotNow because you cannot mark transactions as in-progress.

I'm starting to like these ideas now.

We can solve many things by emitting a new WAL record type in any
subtransaction that issues catalog changes. That wasn't possible in
Hot Standby for performance reasons, but since we only care about
catalog changes those things are much rarer and wouldn't be a problem.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#20Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#18)
Re: Catalog/Metadata consistency during changeset extraction from wal

On Sunday, June 24, 2012 11:37:26 PM Simon Riggs wrote:

On 24 June 2012 22:11, Andres Freund <andres@2ndquadrant.com> wrote:

One interesting problem are table rewrites (truncate, cluster, some ALTER
TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
view it had back when the wal record was created we get the old
relfilenode. Which might have been dropped in part of the transaction
cleanup...
With most types thats not a problem. Even things like records and arrays
aren't problematic. More interesting cases include VACUUM FULL $systable
(e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
function of a type (like a user level pg_enum implementation).

That's only a problem if you are generating changes to the relfilenode
rather than the relid.

Hm. I can't follow so far. Could you paraphrase?

ISTM that this step differs depending upon whether we are generating
portable SQL, or whether we are generating changes for immediate
apply.

I fear only generating changes for immediate, low-level apply is going to fly
given the various interests people have voiced.

If it is the latter, then it should never actually happen because if a table
rewrite occurred and then committed we would never need to re-read earlier
WAL.
So treating this as a generic problem leads to some weird results that
we don't need to worry about cos they can't actually happen.

Well, even if it were true that we don't need to worry about the state before
a full-table rewritte - I don't think it is - we still need to be able to cope
with CLUSTER or VACUUM FULL...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#22)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#24)
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#25)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#27)
#31Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#29)
#33Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kevin Grittner (#31)
#34Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#31)
#35Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#26)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
#37David Fetter
david@fetter.org
In reply to: Kevin Grittner (#31)
#38Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: David Fetter (#37)
#39David Fetter
david@fetter.org
In reply to: Kevin Grittner (#38)