Future In-Core Replication

Started by Simon Riggsalmost 14 years ago49 messageshackers
Jump to latest
#1Simon Riggs
simon@2ndQuadrant.com

I'm beginning to work on advanced additions to in-core replication for
PostgreSQL.

There are a number of additional features for existing single-master
replication still to achieve, but the key topics to be addressed are
major leaps forward in functionality. I hope to add useful features in
9.3, though realise that many things could take two or even more
release cycles to achieve. (The last set of features took 8 years, so
I'm hoping to do this a little faster).

Some of my 2ndQuadrant colleagues will be committing themselves to the
project also and we hope to work with the community in the normal way
to create new features. I mention this only to say that major skills
and resources will be devoted to this for the next release(s), not
that this is a private project.

Some people have talked about the need for "multi-master replication",
whereby 2+ databases communicate changes to one another. This topic
has been discussed in some depth in Computer Science academic papers,
most notably, "The Dangers of Replication and a Solution" by the late
Jim Gray. I've further studied this to the point where I have a
mathematical model of this that allows me to predict what our likely
success will be from implementing that. Without meaning to worry you,
MM replication alone is not a solution for large data or the general
case. For the general case, single master replication will continue to
be the most viable option. For large and distributed data sets, some
form of partitioning/sharding is required simply because full
multi-master replication just isn't viable at both volume and scale.
So my take on this is that MM is desirable, but is not the only thing
we need - we also need partial/filtered replication to make large
systems practical. Hence why I've been calling this the
"Bi-Directional Replication" project. I'm aware that paragraph alone
requires lots of explanation, which I hope to do both in writing and
in person at the forthcoming developer conference.

My starting point for designs is to focus on a key aspect: massive
change to the code base is not viable and any in-core solution must
look at minimally invasive changes. And of course, if it is in-core we
must also add robust, clear code with reasonable performance that do
not impede non-replication usage.

The use cases we will address are not focused on any one project or
user. I've distilled these points so far from talking to a wide
variety of users, from major enterprises to startups.

1. GEOGRAPHICALLY DISTRIBUTED - Large users require both High
Availability, which necessitates multiple nodes, as well as Disaster
Recovery, which necessitates geographically distributed nodes. So my
focus is not focused on "clustering" in the sense of Hadoop or Oracle
RAC, since those technologies require additional technologies to
provide DR, so my aim is to arrive at a coherent set of technologies
that provide all that we want. I'm aware that other projects *are*
focused on clustering, so even more reason not to try to
simultaneously invent the wheel.

2. COHERENT - With regard to the coherence, I note this thinking is
similar to the way that Oracle replication is evolving, where they
have multiple kinds of in-core replication and various purchased
technologies. We have a similar issue with regard to various external
projects. I very much hope that we can utilise the knowledge, code and
expertise of those other projects in the way we move forwards.

3. ONLINE UPGRADE - highly available distributed systems must have a
mechanism for online upgrade, otherwise they won't stay HA for long.
This challenge must be part of the solution, and incidentally should
be a useful goal in itself.

4. MULTI-MASTER - the ability to update data from a variety of locations

5. WRITE-SCALEABLE - the ability to partition data across nodes in a
way that allows the solution to improve beyond the write rate of a
single node.

Those are the basic requirements that I am trying to address. There
are a great many important details, but the core of this is probably
what I would call "logical replication", that is shipping changes to
other nodes in a way that does not tie us to the same physical
representation that recovery/streaming replication does now. Of
course, non-physical replication can take many forms.

The assumption of consistency across nodes is considered optional at
this point, and I hope to support both eagerly consistent and
eventually consistent approaches.

I'm aware that this is a broad topic and many people will want input
on this, and am also aware that will take much time. This post is more
about announcing the project, than discussing specific details.

My strategy for doing this is to come up with some designs and
prototypes of a few things that might be the best way forwards. By
building prototypes we will more quickly be able to address the key
questions before us. So there is currently work on research-based
development to allow wider discussion based upon something more than
just whiteboards. I'll be the first to explain things that don't work.
I also very much agree that "one size fits all" is the wrong strategy.
So there will be implementation options and parameters, and possibly
even multiple replication techniques.

I will also be organising a small-medium sized "Future of In-Core
Replication" meeting in Ottawa on Wed 16 May, 6-10pm. To avoid this
becoming an unworkably large meeting, this will be limited but is open
to highly technical PostgreSQL users who share these requirements, any
attendee of the main developer's meeting that wishes to attend and
other developers working on PostgreSQL replication/related topics.
That will also allow me to order enough pizza for everyone too. I'll
send out private invites to people whom I know (no spam) and I think
may be interested, but you are welcome to email me to get access.
(This will take me a day or two, so don't ping me back you didn't get
your invite).

I'm going to do my best to include the right set of features for the
majority of people, all focused on submissions to PostgreSQL core, not
any external project.

Best Regards

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#1)
Re: Future In-Core Replication

On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I will also be organising a small-medium sized "Future of In-Core
Replication" meeting in Ottawa on Wed 16 May, 6-10pm.

Thanks for such rapid response. I've put up a wiki page and will be
adding names as they come through

http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#3Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Simon Riggs (#1)
Re: Future In-Core Replication

On 4/26/12 7:41 AM, Simon Riggs wrote:

5. WRITE-SCALEABLE - the ability to partition data across nodes in a
way that allows the solution to improve beyond the write rate of a
single node.

It would be valuable to look at READ-SCALEABLE as well; specifically a second form of "synchronous" replication where you can read from a slave "immediately" after transaction commit and have the changes be visible. That ability would make it trivial to spread reads off of the master database.

My hope is this wouldn't be horribly painful to achieve if we relaxed the need to fsync the corresponding WAL on the slave; kind of the opposite of the semi-synchronous mode we have now. My theory is that thanks to full page writes a slave should normally have the necessary pages to handle a WAL record in cache, so actually applying the WAL change shouldn't be horribly slow.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#4Josh Berkus
josh@agliodbs.com
In reply to: Jim Nasby (#3)
Re: Future In-Core Replication

Simon,

So the idea is that you'll present briefly your intentions for 9.3 at
the developer meeting, and then have this in-depth afterwards? Sounds
great.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#4)
Re: Future In-Core Replication

On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

So the idea is that you'll present briefly your intentions for 9.3 at
the developer meeting, and then have this in-depth afterwards?  Sounds
great.

I really, really do not want the developer meeting to turn into a
series of presentations. That's what we had last year, and it was
boring and unproductive. Furthermore, it excludes from the
conversation everyone who doesn't fit into a room in Ottawa. I think
that plans should be presented here, on pgsql-hackers, and the
developer meeting should be reserved for discussion of issues with
which everyone who will be there is already familiar. If a
presentation is required, the developer meeting is the wrong forum,
IMHO of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#1)
Re: Future In-Core Replication

Simon,

I'm beginning to work on advanced additions to in-core replication for
PostgreSQL.

...

Those are the basic requirements that I am trying to address. There
are a great many important details, but the core of this is probably
what I would call "logical replication", that is shipping changes to
other nodes in a way that does not tie us to the same physical
representation that recovery/streaming replication does now. Of
course, non-physical replication can take many forms.

So, I'm a bit confused. You talk about this as "additions to in-core
replication", but then you talk about implementing logical replication,
which would NOT be an addition to Binary replication. Can you explain
what you mean?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#6)
Re: Future In-Core Replication

On Fri, Apr 27, 2012 at 1:26 AM, Josh Berkus <josh@agliodbs.com> wrote:

Simon,

I'm beginning to work on advanced additions to in-core replication for
PostgreSQL.

...

Those are the basic requirements that I am trying to address. There
are a great many important details, but the core of this is probably
what I would call "logical replication", that is shipping changes to
other nodes in a way that does not tie us to the same physical
representation that recovery/streaming replication does now. Of
course, non-physical replication can take many forms.

So, I'm a bit confused.  You talk about this as "additions to in-core
replication", but then you talk about implementing logical replication,
which would NOT be an addition to Binary replication.  Can you explain
what you mean?

The key point is that there is a specific objective of including
additional features in-core. That places some restrictions, but also
offers some opportunities. Tight integration allows performance
improvements, as well as ease of use etc..

I'm not sure what you mean by "would not be an addition to binary
replication". Yes, for reasons most elegantly explained by Robert here
[http://rhaas.blogspot.co.uk/2011/02/case-for-logical-replication.html],
physical/binary replication puts too many restrictions on us and we
cannot solve all of the problems that way. I was unaware of Robert's
post, but it sets the scene clearly.

So the future of in-core replication, IMHO, is some form of
non-physical replication. There are various options there, but
anything that goes in would reuse significant parts of the existing
replication setup that already works so well. Put that another way:
the infrastructure for the secure and efficient transport of
replication messages is already in place. Reuse is also what makes
something useful be achievable in a reasonable timescale.

What we need to consider is the form of those new non-physicalWAL
messages, how they are built on the sender and how they are handled at
the receiving end.

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

It will take time for people to believe this is possible and longer to
analyse and agree the options.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#5)
Re: Future In-Core Replication

On Fri, Apr 27, 2012 at 12:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2012 at 6:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

So the idea is that you'll present briefly your intentions for 9.3 at
the developer meeting, and then have this in-depth afterwards?  Sounds
great.

I really, really do not want the developer meeting to turn into a
series of presentations.  That's what we had last year, and it was
boring and unproductive.  Furthermore, it excludes from the
conversation everyone who doesn't fit into a room in Ottawa.  I think
that plans should be presented here, on pgsql-hackers, and the
developer meeting should be reserved for discussion of issues with
which everyone who will be there is already familiar.  If a
presentation is required, the developer meeting is the wrong forum,
IMHO of course.

I agree.

Obviously, the word presentation is the issue here. If one person
speaks and everybody else is silent that is not a good use of
anybody's time.

On any topic, I expect the introducer of that topic to set the scene,
establish the major questions and encourage discussion.

I would follow that form myself: I want to hear the wisdom of others
and alter my plans accordingly.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#9Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#7)
Re: Future In-Core Replication

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are.

Sounds like Mammoth. You take a look at that?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#9)
Re: Future In-Core Replication

On Fri, Apr 27, 2012 at 6:59 PM, Josh Berkus <josh@agliodbs.com> wrote:

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are.

Sounds like Mammoth.  You take a look at that?

Well, they all sound similar. My info was that Mammoth was not WAL-based.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#11Joshua D. Drake
jd@commandprompt.com
In reply to: Simon Riggs (#10)
Re: Future In-Core Replication

On 04/27/2012 11:33 AM, Simon Riggs wrote:

Well, they all sound similar. My info was that Mammoth was not WAL-based.

Mammoth was transaction log based but not WAL based.

#12Chris Browne
cbbrowne@acm.org
In reply to: Simon Riggs (#7)
Re: Future In-Core Replication

On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

I'll note that the latest version of Slony, expected to be 2.2 (which
generally seems to work, but we're stuck at the moment waiting to get
free cycles to QA it) has made a substantial change to its data
representation.

The triggers used to cook data into a sort of "fractional WHERE
clause," transforming an I/U/D into a string that you'd trivially
combine with the string INSERT INTO/UPDATE/DELETE FROM to get the
logical update. If there was need to do anything fancier, you'd be
left having to have a "fractional SQL parser" to split the data out by
hand.

New in 2.2 is that the log data is split out into an array of text
values which means that if someone wanted to do some transformation,
such as filtering on value, or filtering out columns, they could
modify the application-of-updates code to query for the data that they
want to fiddle with. No parser needed.

It's doubtless worthwhile to take a peek at that to make sure it
informs your data representation appropriately. It's important to
have data represented in a fashion that is amenable to manipulation,
and that decidedly wasn't the case pre-2.2.

I wonder if a meaningful transport mechanism might involve combining:
a) A trigger that indicates that some data needs to be captured in a
"logical" form (rather than the presently pretty purely physical form
of WAL)
b) Perhaps a way of capturing logical updates in WAL
c) One of the old ideas that fell through was to try to capture commit
timestamps via triggers. Doing it directly turned out to be too
controversial to get in. Perhaps that's something that could be
captured via some process that parses WAL.

Something seems wrong about that in that it mixes together updates of
multiple forms into WAL, physical *and* logical, and perhaps that
implies that there should be an altogether separate "logical updates
log." (LUL?) That still involves capturing updates in a duplicative
fashion, e.g. - WAL + LUL, which seems somehow wrong. Or perhaps I'm
tilting at a windmill here. With Slony/Londiste/Bucardo, we're
capturing "LUL" in some tables, meaning that it gets written both to
the tables' data files as well as WAL. Adding a binary LUL eliminates
those table files and attendant WAL updates, thus providing some
savings.

[Insert a LULCATS joke here...]

Perhaps I've just had too much coffee...
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Chris Browne (#12)
Re: Future In-Core Replication

On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

I'll note that the latest version of Slony ...has made a substantial change to its data
representation....

The basic model I'm working to is that "logical replication" will ship
Logical Change Records (LCRs) using the same transport mechanism that
we built for WAL.

How the LCRs are produced and how they are applied is a subject for
debate and measurement. We're lucky enough to have a variety of
mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
its worth adding WAL translation there also. My initial thought is
that WAL translation has many positive aspects to it and we are
investigating. There are also some variants on those themes, such as
the one you discussed above.

You probably won't recognise this as such, but I hope that people
might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
point, we'll all say "thats not Slony", but we'll also say (Josh
already did) "thats not binary replication". But it will be the
descendant of all.

Backwards compatibility is not a goal, please note, but only because
that will complicate matters intensely.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#14Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#13)
Re: Future In-Core Replication

On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote:

On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

I'll note that the latest version of Slony ...has made a substantial change to its data
representation....

The basic model I'm working to is that "logical replication" will ship
Logical Change Records (LCRs) using the same transport mechanism that
we built for WAL.

One outcome of this LCR approach is probably that you will be shipping
changes as they are made and on slave have to either apply them in N
parallel transactions and commit each transaction when the LCR for the
corresponding transaction says so, or you have to collect the LCR-s
before applying and then apply and commit committed-on-master
transactions in commit order and throw away the aborted ones.

The optimal approach will probably be some combination of these, that is
collect and apply short ones, start replay in separate transaction if
commit does not arrive in N ms.

As to what LCRs should contain, it will probably be locical equivalents
of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

The DDL could actually stay "raw" (as in LCRs for system tables) on
generator side as hopefully the rule that system tables cant have
triggers" does not apply when generating the LCR-s on WAL path.
If we need to go back to ALTER TABLE ... commands, then this is probably
the wisest to leave for client. Client here could also be some xReader
like middleman.

I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
added to postgresql's SQL syntax so that the LCRs could be converted to
SQL text for some tasks and thus should be easy to process using generic
text-based tools.
The DML-WITH-LIMIT-1 is required to do single logical updates on tables
with non-unique rows.
And as for any logical updates we will have huge performance problem
when doing UPDATE or DELETE on large table with no indexes, but
fortunately this problem is on slave, not master ;)

Generating and shipping the LCR-s at WAL-generation time or perhaps even
a bit earlier will have a huge performance benefit of not doing double
writing of captured events on the master which currently is needed for
several reasons, the main one being the determining of which
transactions do commit and in what order. (this cant be solved on master
without a local event log table as we dont have commit/rollback
triggers)

If we delegate that part out of the master then this alone enables us to
be almost as fast as WAL based replica in most cases, even when we have
different logical structure on slaves.

How the LCRs are produced and how they are applied is a subject for
debate and measurement. We're lucky enough to have a variety of
mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
its worth adding WAL translation there also. My initial thought is
that WAL translation has many positive aspects to it and we are
investigating. There are also some variants on those themes, such as
the one you discussed above.

You probably won't recognise this as such, but I hope that people
might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
point, we'll all say "thats not Slony", but we'll also say (Josh
already did) "thats not binary replication". But it will be the
descendant of all.

If we get efficient and flexible logical change event generation on the
master, then I'm sure the current trigger-based logical replication
providers will switch (for full replication) or at least add and extra
LCR-source . It may still make sense to leave some flexibility to the
master side, so the some decisions - possibly even complex ones - could
be made when generating the LCR-s

What I would like is to have some of it exposed to userspace via
function which could be used by developers to push their own LCRs.

As metioned above, significant part of this approach can be prototyped
from user-level triggers as soon as we have triggers on commit and
rollback , even though at a slightly reduced performance. That is it
will still have the trigger overhead, but we can omit all the extra
writing and then re-reading and event-table management on the master.

Wanting to play with Streaming Logical Replication (as opposed to
current Chunked Logical Replication) is also one of the reasons that I
complained when the "command triggers" patch was kicked out from 9.2.

Backwards compatibility is not a goal, please note, but only because
that will complicate matters intensely.

Currently there really is nothing similar enough this could be backward
compatible to :)

Show quoted text

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Hannu Krosing
hannu@tm.ee
In reply to: Hannu Krosing (#14)
Re: Future In-Core Replication

On Sat, 2012-04-28 at 21:40 +0200, Hannu Krosing wrote:

On Sat, 2012-04-28 at 09:36 +0100, Simon Riggs wrote:

On Fri, Apr 27, 2012 at 11:50 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Fri, Apr 27, 2012 at 4:11 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I'm hoping to do is to build a basic prototype of logical
replication using WAL translation, so we can inspect it to see what
the downsides are. It's an extremely non-trivial problem and so I
expect there to be mountains to climb. There are other routes to
logical replication, with messages marshalled in a similar way to
Slony/Londiste/Bucardo/Mammoth(?). So there are options, with
measurements to be made and discussions to be had.

I'll note that the latest version of Slony ...has made a substantial change to its data
representation....

Btw, Londiste has also moved from Slony-like trigger with partial SQL to
more general format a few versions ago.

I sure hope they will move to JSON as the data representation now when
we'll have JSON in core in 9.2

--
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#14)
Re: Future In-Core Replication

On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:

As to what LCRs should contain, it will probably be locical equivalents
of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

Yeh

I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
added to postgresql's SQL syntax so that the LCRs could be converted to
SQL text for some tasks and thus should be easy to process using generic
text-based tools.
The DML-WITH-LIMIT-1 is required to do single logical updates on tables
with non-unique rows.
And as for any logical updates we will have huge performance problem
when doing UPDATE or DELETE on large table with no indexes, but
fortunately this problem is on slave, not master ;)

While that is possible, I would favour the do-nothing approach. By
making the default replication mode = none, we then require a PK to be
assigned before allowing replication mode = on for a table. Trying to
replicate tables without PKs is a problem that can wait basically.

Generating and shipping the LCR-s at WAL-generation time or perhaps even
a bit earlier will have a huge performance benefit of not doing double
writing of captured events on the master which currently is needed for
several reasons, the main one being the determining of which
transactions do commit and in what order. (this cant be solved on master
without a local event log table as we dont have commit/rollback
triggers)

If we delegate that part out of the master then this alone enables us to
be almost as fast as WAL based replica in most cases, even when we have
different logical structure on slaves.

Agreed

How the LCRs are produced and how they are applied is a subject for
debate and measurement. We're lucky enough to have a variety of
mechanisms to compare, Slony 1.0/2.0, Slony 2.2/Londiste/Bucardo and
its worth adding WAL translation there also. My initial thought is
that WAL translation has many positive aspects to it and we are
investigating. There are also some variants on those themes, such as
the one you discussed above.

You probably won't recognise this as such, but I hope that people
might see that I'm hoping to build Slony 3.0, Londiste++ etc. At some
point, we'll all say "thats not Slony", but we'll also say (Josh
already did) "thats not binary replication". But it will be the
descendant of all.

If we get efficient and flexible logical change event generation on the
master, then I'm sure the current trigger-based logical replication
providers will switch (for full replication) or at least add and extra
LCR-source . It may still make sense to leave some flexibility to the
master side, so the some decisions - possibly even complex ones - could
be made when generating the LCR-s

What I would like is to have some of it exposed to userspace via
function which could be used by developers to push their own LCRs.

Yes, I see that one coming.

That use case is not something I'm focused on, but I do recognise
others wish to pursue that.

The bit I'm not sure about is whether we have custom handler code as well.

As metioned above, significant part of this approach can be prototyped
from user-level triggers as soon as we have triggers on commit and
rollback , even though at a slightly reduced performance. That is it
will still have the trigger overhead, but we can omit all the extra
writing and then re-reading and event-table management on the master.

Agreed.

Wanting to play with Streaming Logical Replication (as opposed to
current Chunked Logical Replication) is also one of the reasons that I
complained when the "command triggers" patch was kicked out from 9.2.

Yeh. It's clear that project needs to move forwards quickly in 9.3 if
we are to make this Just Work in the way we hope.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#17Hannu Krosing
hannu@tm.ee
In reply to: Simon Riggs (#16)
Re: Future In-Core Replication

On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote:

On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:

As to what LCRs should contain, it will probably be locical equivalents
of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

Yeh

I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
added to postgresql's SQL syntax so that the LCRs could be converted to
SQL text for some tasks and thus should be easy to process using generic
text-based tools.
The DML-WITH-LIMIT-1 is required to do single logical updates on tables
with non-unique rows.
And as for any logical updates we will have huge performance problem
when doing UPDATE or DELETE on large table with no indexes, but
fortunately this problem is on slave, not master ;)

While that is possible, I would favour the do-nothing approach. By
making the default replication mode = none, we then require a PK to be
assigned before allowing replication mode = on for a table. Trying to
replicate tables without PKs is a problem that can wait basically.

While this is a good approach in most cases, there is a large use case
for pk-less / indexless tables in large logfiles, where you may want to
do INSERT only replication, perhaps with some automatic partitioning on
logdate. Allowing this is probably something to look at in the first
release, even though I'm not sure wht would happen on violation of this
insert-only policy.
Should it
* refuse to continue and rollback the transaction (probably not)
* fail silently
* succeed but log the change locally
* succseed with some special flags so other side can treat it specially
without having to look up stuff in system catalog
* (if we mark the unique / pk fields in some special way anyway, then
the previous one is free :)

--
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#17)
Re: Future In-Core Replication

On Sun, Apr 29, 2012 at 6:20 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

On Sun, 2012-04-29 at 12:03 +0100, Simon Riggs wrote:

On Sat, Apr 28, 2012 at 8:40 PM, Hannu Krosing <hannu@krosing.net> wrote:

As to what LCRs should contain, it will probably be locical equivalents
of INSERT, UPDATE ... LIMIT 1, DELETE ... LIMIT 1, TRUNCATE and all DDL.

Yeh

I would even go as far as propose a variant for DML-WITH-LIMIT-1 to be
added to postgresql's SQL syntax so that the LCRs could be converted to
SQL text for some tasks and thus should be easy to process using generic
text-based tools.
The DML-WITH-LIMIT-1 is required to do single logical updates on tables
with non-unique rows.
And as for any logical updates we will have huge performance problem
when doing UPDATE or DELETE on large table with no indexes, but
fortunately this problem is on slave, not master ;)

While that is possible, I would favour the do-nothing approach. By
making the default replication mode = none, we then require a PK to be
assigned before allowing replication mode = on for a table. Trying to
replicate tables without PKs is a problem that can wait basically.

While this is a good approach in most cases, there is a large use case
for pk-less / indexless tables in large logfiles, where you may want to
do INSERT only replication, perhaps with some automatic partitioning on
logdate. Allowing this is probably something to look at in the first
release, even though I'm not sure wht would happen on violation of this
insert-only policy.
Should it
* refuse to continue and rollback the transaction (probably not)
* fail silently
* succeed but log the change locally
* succseed with some special flags so other side can treat it specially
without having to look up stuff in system catalog
* (if we mark the unique / pk fields in some special way anyway, then
the previous one is free :)

OK, I think an insert-only replication mode would allow that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#2)
Re: Future In-Core Replication

Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:

On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I will also be organising a small-medium sized "Future of In-Core
Replication" meeting in Ottawa on Wed 16 May, 6-10pm.

Thanks for such rapid response. I've put up a wiki page and will be
adding names as they come through

http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting

How is this not redundant with the Cluster Summit?
http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit

... oh, you're also already enlisted in that one. Sigh.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#20Dave Page
dpage@pgadmin.org
In reply to: Alvaro Herrera (#19)
Re: Future In-Core Replication

On Sun, Apr 29, 2012 at 11:20 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Simon Riggs's message of jue abr 26 11:10:09 -0300 2012:

On Thu, Apr 26, 2012 at 1:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I will also be organising a small-medium sized "Future of In-Core
Replication" meeting in Ottawa on Wed 16 May, 6-10pm.

Thanks for such rapid response. I've put up a wiki page and will be
adding names as they come through

http://wiki.postgresql.org/wiki/PgCon2012CanadaInCoreReplicationMeeting

How is this not redundant with the Cluster Summit?
http://wiki.postgresql.org/wiki/PgCon2012CanadaClusterSummit

... oh, you're also already enlisted in that one.  Sigh.

My understanding is that the agenda for the cluster meeting is almost
entirely dedicated to Postgres-XC.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Dave Page (#20)
#22Atri Sharma
atri.jiit@gmail.com
In reply to: Simon Riggs (#21)
#23Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
#24Merlin Moncure
mmoncure@gmail.com
In reply to: Bruce Momjian (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#23)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Merlin Moncure (#24)
#27Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#25)
#28Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#26)
#29Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#26)
#30Josh Berkus
josh@agliodbs.com
In reply to: Merlin Moncure (#29)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#30)
#32Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#31)
#33Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Simon Riggs (#1)
#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Tatsuo Ishii (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#30)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#35)
#37Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Simon Riggs (#16)
#38Josh Berkus
josh@agliodbs.com
In reply to: Jim Nasby (#37)
#39Josh Berkus
josh@agliodbs.com
In reply to: Jim Nasby (#37)
#40Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#39)
#41Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#35)
#42Hannu Krosing
hannu@tm.ee
In reply to: Jim Nasby (#37)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Hannu Krosing (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#41)
#45Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#43)
#46Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#44)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#46)
#48Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#48)