Re: xReader, double-effort (was: Temporary tables under hot standby)

Started by Simon Riggsover 13 years ago36 messages

simon@2ndQuadrant.com

over 13 years ago

On Thu, Apr 26, 2012 at 3:40 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

<sigh> Yes, it appears very similar. I didn't know about this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Import Notes

Reply to msg id not found: 4F99181402000025000473D9@gw.wicourts.govReference msg id not found: 4F99181402000025000473D9@gw.wicourts.gov

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Simon Riggs (#1)

On Thu, Apr 26, 2012 at 11:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Apr 26, 2012 at 3:40 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

<sigh> Yes, it appears very similar. I didn't know about this.

I don't recall seeing any previous postings about the 2Q project,
either. We should try to keep these discussions on-list.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Robert Haas (#2)

On Thu, Apr 26, 2012 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 26, 2012 at 11:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Apr 26, 2012 at 3:40 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

<sigh> Yes, it appears very similar. I didn't know about this.

I don't recall seeing any previous postings about the 2Q project,
either. We should try to keep these discussions on-list.

While that's true, it wasn't much of a stretch to think that I might
be planning to do more work on replication since I've done that for
years and years now. Also, I do keep requesting GSOC doesn't put such
major tasks out there, but not much heed taken.

Beautiful illustration of a multi-master conflict case though.

Not sure just yet what the resolution is, and it will probably take a
few weeks to resolve that happily.

Glad we're all facing in the same direction, which is the most important thing.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Josh Berkus

josh@postgresql.org

over 13 years ago

In reply to: Simon Riggs (#3)

Re: xReader, double-effort

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

<sigh> Yes, it appears very similar. I didn't know about this.

This isn't a problem. In fact, it's the *exact opposite* of a problem.

The student wanted to do more advanced things with the xlog stream, and
we convinced him to narrow the scope in order to make the project
achievable.

With 2Q having done some/most of the basic work, the student can focus
on more advanced applications of the data again. Win!

However, it would be great if someone from 2Q co-mentored him.

--Josh

Aakash Goel

aakash.bits@gmail.com

over 13 years ago

In reply to: Simon Riggs (#3)

Sure Kevin, will get the wiki page ready asap, and reply back. Thanks.

On Thu, Apr 26, 2012 at 8:10 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov

Show quoted text

wrote:

[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

This is already nearly finished in prototype and will be published
in May. Andres Freund is working on it, copied here.

URL?

It looks like there is significant overlap there.

Hard for me to know without more information. It sounds like there
is at least some overlap. I hope that can involve cooperation, with
the efforts of Andres forming the basis of Aakash's GSoC effort.
That might leave him more time to polish up the user filters.

Aakash: It seems like we need that Wiki page rather sooner than
later. Can you get to that quickly? I would think that just
copying the text from your approved GSoC proposal would be a very
good start. If you need help figuring out how to embed the images
from your proposal, let me know.

-Kevin

Import Notes

Reply to msg id not found: 4F99181402000025000473D9@gw.wicourts.govReference msg id not found: 4F99181402000025000473D9@gw.wicourts.gov | Resolved by subject fallback

Aakash Goel

aakash.bits@gmail.com

over 13 years ago

In reply to: Aakash Goel (#5)

All, the wiki page is now up at http://wiki.postgresql.org/wiki/XReader.

On Sat, Apr 28, 2012 at 1:19 AM, Aakash Goel <aakash.bits@gmail.com> wrote:

Show quoted text

Sure Kevin, will get the wiki page ready asap, and reply back. Thanks.

On Thu, Apr 26, 2012 at 8:10 PM, Kevin Grittner <
Kevin.Grittner@wicourts.gov> wrote:

[resending because of postgresql.org bounces on first try]

Simon Riggs <simon@2ndquadrant.com> wrote:

Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

The GSoC xReader project is intended to be a major step toward
that, by providing a way to translate the WAL stream to a series
of notifications of logical events to clients which register with
xReader.

This is already nearly finished in prototype and will be published
in May. Andres Freund is working on it, copied here.

URL?

It looks like there is significant overlap there.

Hard for me to know without more information. It sounds like there
is at least some overlap. I hope that can involve cooperation, with
the efforts of Andres forming the basis of Aakash's GSoC effort.
That might leave him more time to polish up the user filters.

Aakash: It seems like we need that Wiki page rather sooner than
later. Can you get to that quickly? I would think that just
copying the text from your approved GSoC proposal would be a very
good start. If you need help figuring out how to embed the images
from your proposal, let me know.

-Kevin

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Aakash Goel (#6)

[replaced bad email address for Josh (which was my fault)]

Aakash Goel <aakash.bits@gmail.com> wrote:

All, the wiki page is now up at
http://wiki.postgresql.org/wiki/XReader.

Note that the approach Aakash is taking doesn't involve changes to
the backend code, it is strictly a standalone executable to which
functions as a proxy to a hot standby and to which clients like
replications systems connect. There is a possible additional
configuration which wouldn't require a hot standby, if time permits.
I am not clear on whether 2nd Quadrant's code takes this approach
or builds it into the server. I think we need to know that much
before we can get very far in discussion.

-Kevin

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Kevin Grittner (#7)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Friday, April 27, 2012 11:04:04 PM Kevin Grittner wrote:

[replaced bad email address for Josh (which was my fault)]

Aakash Goel <aakash.bits@gmail.com> wrote:

All, the wiki page is now up at

http://wiki.postgresql.org/wiki/XReader.

Note that the approach Aakash is taking doesn't involve changes to
the backend code, it is strictly a standalone executable to which
functions as a proxy to a hot standby and to which clients like
replications systems connect. There is a possible additional
configuration which wouldn't require a hot standby, if time permits.
I am not clear on whether 2nd Quadrant's code takes this approach
or builds it into the server. I think we need to know that much
before we can get very far in discussion.

In the current, prototypal, state there is one component thats integrated into
the server (because it needs information thats only available there). That
component is layered ontop of a totally generic xlog reading/parsing library
that doesn't care at all where its running. Its also used in another cluster
to read the received (filtered) stream.
I plan to submit the XLogReader (thats what its called atm) before everything
else, so everybody can take a look as soon as possible.

I took a *very* short glance over the current wiki description of xReader and
from that it seems to me it would benefit from trying to make it
architecturally more similar to the rest of pg. I also would suggest reviewing
how the current walreceiver/sender, and their protocol, work.

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Andres Freund (#8)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Andres Freund <andres@2ndquadrant.com> wrote:

In the current, prototypal, state there is one component thats
integrated into the server (because it needs information thats
only available there).

The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.

I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.

That component is layered ontop of a totally generic xlog
reading/parsing library that doesn't care at all where its
running.

That's cool.

Its also used in another cluster to read the received (filtered)
stream.

I don't quite follow what you're saying there.

I plan to submit the XLogReader (thats what its called atm)
before everything else, so everybody can take a look as soon as
possible.

Great! That will allow more discussion and planning.

I took a *very* short glance over the current wiki description of
xReader and from that it seems to me it would benefit from trying
to make it architecturally more similar to the rest of pg.

We're planning on using existing protocol to talk between pieces.
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?

I also would suggest reviewing how the current walreceiver/sender,
and their protocol, work.

Of course! The first "inch-stone" in the GSoC project plan
basically consists of creating an executable that functions as a
walreceiver and a walsender to just pass things through from the
master to the slave. We build from there by allowing clients to
connect (again, over existing protocol) and register for events of
interest, and then recognizing different WAL records to generate
events. The project was just going to create a simple client to
dump the information to disk, but with the time saved by adopting
what you've already done, that might leave more time for generating
a useful client.

Aakash, when you get a chance, could you fill in the "inch-stones"
from the GSoC proposal page onto the Wiki page? I think the
descriptions of those interim steps would help people understand
your proposal better. Obviously, some of the particulars of tasks
and the dates may need adjustment based on the new work which is
expected to appear before you start, but what's there now would be a
helpful reference.

-Kevin

#10

Greg Smith

greg@2ndQuadrant.com

over 13 years ago

In reply to: Robert Haas (#2)

Re: xReader, double-effort

On 04/26/2012 12:57 PM, Robert Haas wrote:

I don't recall seeing any previous postings about the 2Q project,
either. We should try to keep these discussions on-list.

First brought up two months ago to the list it seemed most relevant to
at the time:
http://archives.postgresql.org/pgsql-cluster-hackers/2012-02/msg00001.php and
the solid commentary started coming out yesterday, right near the
approach of May as promised. There hasn't been much posted on hackers
about this out of respect for the extended CommitFest.

I personally suggested Aakash vent ideas out on pgsql-students (or here)
at the beginning of the month:
http://archives.postgresql.org/pgsql-students/2012-04/msg00002.php but
the next time I heard about the subject was the acceptance notice.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#11

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Kevin Grittner (#9)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Hi Kevin, Hi Aakash,

On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:

Andres Freund <andres@2ndquadrant.com> wrote:

In the current, prototypal, state there is one component thats
integrated into the server (because it needs information thats
only available there).

The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.

Yes, that does make sense for some workloads. I don't think its viable for
everything though, thats why were not aiming for that ourselves atm.

I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.

I don't think that you can do everything apart from the master. We currently
need shared memory for coordination between the moving parts, thats why we
have it inside the master.
It also have the advantage of being easier to setup.

That component is layered ontop of a totally generic xlog
reading/parsing library that doesn't care at all where its
running.

That's cool.

Its also used in another cluster to read the received (filtered)
stream.

I don't quite follow what you're saying there.

To interpret the xlog back into something that can be used for replication you
need to read it again. After filtering we again write valid WAL, so we can use
the same library on the sending|filtering side and on the receiving side.
But thats actually off topic for this thread ;)

I took a *very* short glance over the current wiki description of
xReader and from that it seems to me it would benefit from trying
to make it architecturally more similar to the rest of pg.

We're planning on using existing protocol to talk between pieces.
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?

It sounds like the xReader is designed to be one multiplexing process. While
this definitely has some advantages resource-usage-wise it doesn't seem to be
fitting the rest of the design that well. The advantages might outweigh
everything else, but I am not sure about that.
Something like registering/deregistering also doesn't fit that well with the
way walsender works as far as I understand it.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Greg Smith (#10)

Re: xReader, double-effort

Greg Smith <greg@2ndQuadrant.com> wrote:

On 04/26/2012 12:57 PM, Robert Haas wrote:

I don't recall seeing any previous postings about the 2Q project,
either. We should try to keep these discussions on-list.

First brought up two months ago to the list it seemed most
relevant to at the time:

http://archives.postgresql.org/pgsql-cluster-hackers/2012-02/msg00001.php

and the solid commentary started coming out yesterday, right near
the approach of May as promised. There hasn't been much posted on
hackers about this out of respect for the extended CommitFest.

I personally suggested Aakash vent ideas out on pgsql-students (or
here) at the beginning of the month:
http://archives.postgresql.org/pgsql-students/2012-04/msg00002.php
but the next time I heard about the subject was the acceptance
notice.

That's largely my fault. Aakash's original proposal was naively
assuming that logical records were available in our xlog stream and
wanted to work on tools to do something useful with that data.
There was some confusion among the various core team members,
committers, and other contributors who volunteered to review the
proposals about whether he was talking about analyzing the server
log or xlog. We had a wide-ranging discussion on the freenode
#postgresql channel about it, and brainstormed to the current idea
-- at which point Aakash had about a week and a half to rewrite his
proposal along the lines discussed on irc. A week after that the
acceptance notices came out.

If we had had any idea that 2Q was working on something along these
lines, we would have coordinated or guided him in another direction.
The post you cite from a couple months ago was way too vague to
have even an inkling that it had anything to do with providing feeds
of the logical interpretation of the WAL stream. I don't recall
having heard the idea even mentioned before.

The timing in regard to the release process seems to have had an
effect on both groups here, since I suggested he read up on
walsender and walreceiver and a few other areas, put together a Wiki
page with the plan, and *then* describe the preliminary plan to the
community (a month or so before starting to code), so that details
could be hashed out in public before time was spent coding. Had
there not been so many people heads-down trying to get the release
out, I might have advised him to post a hand-wavy description of the
effort before he got the Wiki page set up and reviewed the relevant
code.

As Josh points out, his first choice was to write utilities to *do*
something with the logical xlog information, and he only switched
over to working on *producing* it when he heard that no such thing
was currently available. I think there's a lot of room to
coordinate these efforts so that there *isn't* a waste of effort as
long as communication is open enough.

-Kevin

#13

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Andres Freund (#11)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Andres Freund <andres@2ndquadrant.com> wrote:

Something like registering/deregistering also doesn't fit that
well with the way walsender works as far as I understand it.

If you look at the diagrams on the xReader Wiki page, the lines
labeled "XLOG stream" are the ones using walsender/walreceiver. The
green arrows represent normal connections to the database, to run
queries to retrieve metadata needed to interpret the WAL records,
and the lines labeled "Listener n" are expected to use the pg
protocol to connect, but won't be talking page-oriented WAL -- they
will be dealing with logical interpretation of the WAL. The sort of
data which could be fed to a database which doesn't have the same
page images. Like Slony et al do.

Perhaps, given other points you made, the library for interpreting
the WAL records could be shared, and hopefully a protocol for the
clients, although that seems a lot more muddy to me at this point.
If we can share enough code, there may be room for both approaches
with minimal code duplication.

-Kevin

#14

Greg Smith

greg@2ndQuadrant.com

over 13 years ago

In reply to: Kevin Grittner (#12)

Re: xReader, double-effort

On 04/27/2012 06:48 PM, Kevin Grittner wrote:

That's largely my fault.

It may not have come out that way, but I was trying more to point out
the complexity of the story rather than assign blame. I find it hard to
point at any obvious "don't do that again" target here.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#15

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Kevin Grittner (#9)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Fri, Apr 27, 2012 at 11:18 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Andres Freund <andres@2ndquadrant.com> wrote:

I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.

There are use cases where the translation from WAL to logical takes
place on the master, the standby or other locations.

It's becoming clear that filtering records on the source is important
in high bandwidth systems, so the initial work focuses on putting that
on the "master", i.e. the source. Which was not my first thought
either. If you use cascading, this would still allow you to have
master -> standby -> logical.

Translating WAL is a very hard task. Some time ago, I did also think
an external tool would help (my initial design was called xfilter),
but I no longer think that is likely to work very well apart from very
simple cases.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#16

Aakash Goel

aakash.bits@gmail.com

over 13 years ago

In reply to: Kevin Grittner (#9)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Aakash, when you get a chance, could you fill in the "inch-stones"
from the GSoC proposal page onto the Wiki page?

Sure, http://wiki.postgresql.org/wiki/XReader updated.

On Sat, Apr 28, 2012 at 3:48 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov

Show quoted text

wrote:

Andres Freund <andres@2ndquadrant.com> wrote:

In the current, prototypal, state there is one component thats
integrated into the server (because it needs information thats
only available there).

The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.

I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.

That component is layered ontop of a totally generic xlog
reading/parsing library that doesn't care at all where its
running.

That's cool.

Its also used in another cluster to read the received (filtered)
stream.

I don't quite follow what you're saying there.

I plan to submit the XLogReader (thats what its called atm)
before everything else, so everybody can take a look as soon as
possible.

Great! That will allow more discussion and planning.

I took a *very* short glance over the current wiki description of
xReader and from that it seems to me it would benefit from trying
to make it architecturally more similar to the rest of pg.

We're planning on using existing protocol to talk between pieces.
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?

I also would suggest reviewing how the current walreceiver/sender,
and their protocol, work.

Of course! The first "inch-stone" in the GSoC project plan
basically consists of creating an executable that functions as a
walreceiver and a walsender to just pass things through from the
master to the slave. We build from there by allowing clients to
connect (again, over existing protocol) and register for events of
interest, and then recognizing different WAL records to generate
events. The project was just going to create a simple client to
dump the information to disk, but with the time saved by adopting
what you've already done, that might leave more time for generating
a useful client.

Aakash, when you get a chance, could you fill in the "inch-stones"
from the GSoC proposal page onto the Wiki page? I think the
descriptions of those interim steps would help people understand
your proposal better. Obviously, some of the particulars of tasks
and the dates may need adjustment based on the new work which is
expected to appear before you start, but what's there now would be a
helpful reference.

-Kevin

#17

Aakash Goel

aakash.bits@gmail.com

over 13 years ago

In reply to: Andres Freund (#11)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Hello Andres,

The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.

Yes, that does make sense for some workloads. I don't think its viable for
everything though, thats why were not aiming for that ourselves atm.

Regarding the above, what would be a case where querying the HS will not
suffice?

On Sat, Apr 28, 2012 at 4:02 AM, Andres Freund <andres@2ndquadrant.com>wrote:

Show quoted text

Hi Kevin, Hi Aakash,

On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:

Andres Freund <andres@2ndquadrant.com> wrote:

In the current, prototypal, state there is one component thats
integrated into the server (because it needs information thats
only available there).

The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.

Yes, that does make sense for some workloads. I don't think its viable for
everything though, thats why were not aiming for that ourselves atm.

I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.

I don't think that you can do everything apart from the master. We
currently
need shared memory for coordination between the moving parts, thats why we
have it inside the master.
It also have the advantage of being easier to setup.

That component is layered ontop of a totally generic xlog
reading/parsing library that doesn't care at all where its
running.

That's cool.

Its also used in another cluster to read the received (filtered)
stream.

I don't quite follow what you're saying there.

To interpret the xlog back into something that can be used for replication
you
need to read it again. After filtering we again write valid WAL, so we can
use
the same library on the sending|filtering side and on the receiving side.
But thats actually off topic for this thread ;)

I took a *very* short glance over the current wiki description of
xReader and from that it seems to me it would benefit from trying
to make it architecturally more similar to the rest of pg.

We're planning on using existing protocol to talk between pieces.
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?

It sounds like the xReader is designed to be one multiplexing process.
While
this definitely has some advantages resource-usage-wise it doesn't seem to
be
fitting the rest of the design that well. The advantages might outweigh
everything else, but I am not sure about that.
Something like registering/deregistering also doesn't fit that well with
the
way walsender works as far as I understand it.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#18

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Simon Riggs (#15)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Simon Riggs <simon@2ndQuadrant.com> writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face. Just for
starters, where will you get table and column names from? (Looking at
the system catalogs is cheating, and will not work reliably anyway.)

IMO, if we want non-physical replication, we're going to need to build
it in at a higher level than after-the-fact processing of WAL.
I foresee wasting quite a lot of effort on the currently proposed
approaches before we admit that they're unworkable.

regards, tom lane

#19

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Kevin Grittner (#13)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Tom Lane wrote:

Simon Riggs writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face. Just for
starters, where will you get table and column names from? (Looking
at the system catalogs is cheating, and will not work reliably
anyway.)

Well, the xReader approach (Aakash's GSoC project) is to serve as a
proxy for a WAL stream going to a hot standby, to interpret each
incoming WAL record by "cheating" and querying the HS before passing
the WAL along, and then using sync rep to the HS regardless of
whether the incoming stream is sync or async. What reliability
problem do you see with this approach?

-Kevin

Import Notes

Resolved by subject fallback

#20

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Kevin Grittner (#19)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane wrote:

Simon Riggs writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face.

Well, the xReader approach (Aakash's GSoC project) is to serve as a
proxy for a WAL stream going to a hot standby, to interpret each
incoming WAL record by "cheating" and querying the HS before passing
the WAL along, and then using sync rep to the HS regardless of
whether the incoming stream is sync or async. What reliability
problem do you see with this approach?

Well, first off, it requires a replica that *is* a physical replica
(else it won't have the same table OIDs, for instance). I'm not sure I
get the point of logical replication that requires a physical replica as
a prerequisite.

Next, it breaks immediately in the face of DDL:

CREATE TABLE foo (f1 int);
BEGIN;
ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2;
INSERT INTO foo (f2) VALUES (1);
COMMIT;

The standby is not going to think that the ALTER is committed, so it
will not report the right column name when it comes time to translate
the INSERT. Actually, you can break it even more easily than that:

CREATE TABLE bar AS SELECT ...

What will you do with the insertions executed by this CREATE? They have
to be executed before the creation of table bar is committed.

Also, you'd need a query connection per database (at least, maybe one
per session if you were trying to track DDL effects), which seems rather
a lot of load on the HS slave. That together with the requirement for
synchronous operation seems absolutely catastrophic from a performance
standpoint.

If you want to spend the summer building a toy, fine, but I don't see
this going anywhere for production purposes. It would be enormously
more performant for the master to be emitting logical replication
records to start with, since it already has all the right names etc
at hand at basically no cost.

regards, tom lane

#21

Aakash Goel

aakash.bits@gmail.com

over 13 years ago

In reply to: Tom Lane (#20)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Hello Tom,

I'm not sure I
get the point of logical replication that requires a physical replica as
a prerequisite.

It would be enormously
more performant for the master to be emitting logical replication
records to start with, since it already has all the right names etc
at hand at basically no cost.

I want to emphasize that any system which uses the logical reader setup,
including a replication system, tends to be very computationally intensive
on the database which it queries all so often. In most of the environments,
the source database is the main database, and as such, any performance
degradation on this database is very bad.

On the other hand, if we offload almost all the work to the physical
replica, our source database, which is the main database, still functions
at the same throughput.

Thus, at the cost of having the replication system as a whole run a little
slower than it could if it were using the main database, we have made sure
that our performance critical main source database is not affected in
performance at all.

#22

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Tom Lane (#20)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sat, Apr 28, 2012 at 4:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Tom Lane wrote:

Simon Riggs writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face.

Well, the xReader approach (Aakash's GSoC project) is to serve as a
proxy for a WAL stream going to a hot standby, to interpret each
incoming WAL record by "cheating" and querying the HS before passing
the WAL along, and then using sync rep to the HS regardless of
whether the incoming stream is sync or async. What reliability
problem do you see with this approach?

Well, first off, it requires a replica that *is* a physical replica
(else it won't have the same table OIDs, for instance).

It's even harder than that because the physical replica needs to be
able to be shutdown/restarted. We currently restart WAL from last
checkpoint, but if changes have already been made and committed then
the catalog will be incorrect when we replay the WAL again. So, I
think saying very hard is justified.

We basically have a choice of shipping full metadata with every change
or finding some way to avoid doing that. Investigating the latter
seems like a worthwhile task but there are limits, as you say.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Kevin Grittner (#19)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Tom Lane wrote:

"Kevin Grittner" writes:

Well, the xReader approach (Aakash's GSoC project) is to serve as
a proxy for a WAL stream going to a hot standby, to interpret each
incoming WAL record by "cheating" and querying the HS before
passing the WAL along, and then using sync rep to the HS
regardless of whether the incoming stream is sync or async. What
reliability problem do you see with this approach?

Well, first off, it requires a replica that *is* a physical replica
(else it won't have the same table OIDs, for instance). I'm not
sure I get the point of logical replication that requires a
physical replica as a prerequisite.

Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data. Due to state law, state supreme
court rules, and the general suckiness of the WAN available to us,
each of Wisconsin's 72 counties has a database server in the county
seat which is the "official" data of record. We have two parallel
forms of backup -- WAL-based and logical.

For WAL-based replication we maintain a copy of the last two weekly
base backups and all WAL files needed to recover from either of them
to any point in time. One copy of this is required to be in the
county, on a separate system with separate drives. The other copy is
maintained centrally. One rather large server has a hot standby of
all county databases, currently fed through copying WAL files
(although we plan to move to streaming this year). These "physical"
backups are the primary source for recovering from catastrophic
failure of a county server.

We have software to "top off" the recovered database with
transactions from the logical transaction stream as needed, since
that is more up-to-date. We have also found this combination of
physical and logical transactions useful when someone trying to do a
data fix with direct SQL statements against a production database
mangled things badly. We use PITR recovery up to just before the
damage, then apply logical transactions from that point forward,
skipping the problem transactions.

Until recently our logical transaction stream was generated from our
Java middle tier running on the database servers. This has recently
been converted to use a generalized PostgreSQL trigger, written in C,
which is attached to all tables for which we want logical replication
(about 90% of them). The trigger captures data into transaction
tables in the database as part of the same transaction they are
representing, so that the generation of the logical transaction data
is atomic with the represented data. We didn't switch to one of the
existing solutions because of all the code downstream which uses the
logical transaction streams -- we didn't want to change the format or
delivery method of the logical steam because it's simply not feasible
to change everything at once.

The way the logical stream is currently delivered is still through
the aforementioned Java middle tier. It reads the logical
transaction log tables and sends the same data in non-persistent JMS
messages through queues to the "TRAN client" at the central location,
just as we did for years before we ever heard of PostgreSQL.
Normally, the TRAN client reads each logical transaction once, and
feeds many "replication targets" at the central site. These include:

- four "Central Information Repository" (CIR) databases each of
which contains data from all 72 counties (two machines each at
two different sites)

- four "Database Transaction Repository" (DTR) databases on the
same servers as the CIR databases (but separate RAIDs) This is
used as a supplementary source for recovery (as mentioned
above), as well as for auditing what happened on a case when
there is any question, and certain activity reports.

- a filter to identify transactions of possible interest to the
publishers we use for interfaces to other agencies (district
attorneys, public defenders, department of corrections, police
agencies, social service agencies, etc.) Transactions of
interest are queued for closer review by the publisher software,
which actually posts messages if the transaction is found to
actually be of interest.

- other databases for the state court's "central services" for
such things as court interpreters used for non-English speakers
in the various counties

The number and varied nature of the targets is significant, because
the "fan out" and timing differences need to be considered. The TRAN
client normally receives one copy of the logical data for each
transaction, and about 99% of the time the transaction's data is
replicated to all databases (and queued for review by publishers)
within a second of someone pressing Enter 300 miles away. It's that
other 1% that gets really interesting.

To deal with the possibility that one or more central targets were
down or running very slowly, the TRAN client will give up on keeping
them at the same point in the transaction stream after a
(configurable) delay. Since this replication is important for
disaster recovery, we don't want problems on one server to keep
others from staying up-to-date. As currently configured, for every
set of targets which falls behind, there is a redundant stream of
logical transactions across the WAN.

Let's talk about the WAN. We're somewhat "compelled" to use the WAN
offered by the state's Department of Administration. It is
expensive, so we use the slowest speed that will work for each
county. That can be as low as 750 Mbps up to 10 Gbps, and even at
those pitiful rates, this was costing us (last I heard) over $80,000
per month -- so boosting WAN speed is a last resort. Besides being
the conduit for both the WAL and logical transaction streams, this
same WAN is used for file server backups, access to the Internet for
court staff, and inter-county database access (such as District Court
Administrators or Chief Judges access all counties in their
respective districts).

Sometimes a WAN pipe gets saturated and the transaction streams for a
county fall behind. This can be aggravated by other traffic (such as
when all court employees in a county are asked to view a training
video via the Internet), or by large raster-scanned documents being
added to the database. In these cases the "let the lagging target
spawn its own transaction stream" approach currently becomes
pathological. On the saturated connection an ACK of a message from
one or more targets may be delayed long enough that it splits the
stream and makes a bad situation even worse.

We have an item on our annual plan to try to address these problems,
which two people were assigned to start working on next month. The
xReader technology proposed by Aakash is so much better than what we
otherwise can do, that when management heard about it, they delayed
that project in hopes that by waiting a few months, it can be done
with xReader. The goal is to move to streaming replication and have
that be the one stream of transaction data flowing over the WAN from
the counties to the central location. From this stream we want to
generate the logical transactions to feed all the other systems,
while maintaining the WAL-based HS as the primary source for disaster
recovery. It is important that, at least initially, we can produce
logical transactions in the format accepted by all these other
internal systems.

The plan *was* to continue to fire the triggers to capture the data
on the county servers, but read those tables on the HS. That has
many obvious disadvantages compared to either of the proposed logical
replication systems. But for our purposes, generating logical
transactions at the central site rather than at the original database
is far superior, in that it means we only send the data over the WAN
once, not twice.

Next, it breaks immediately in the face of DDL:

CREATE TABLE foo (f1 int);
BEGIN;
ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2;
INSERT INTO foo (f2) VALUES (1);
COMMIT;

The standby is not going to think that the ALTER is committed, so
it will not report the right column name when it comes time to
translate the INSERT. Actually, you can break it even more easily
than that:

CREATE TABLE bar AS SELECT ...

What will you do with the insertions executed by this CREATE? They
have to be executed before the creation of table bar is committed.

Also, you'd need a query connection per database (at least, maybe
one per session if you were trying to track DDL effects), which
seems rather a lot of load on the HS slave. That together with the
requirement for synchronous operation seems absolutely catastrophic
from a performance standpoint.

Good points, but I think they all yield to one solution. If we can
generate the appropriate snapshot based on the WAL record we are
looking at, those problems all evaporate, don't they? It seems to me
that we could build on the "shared snapshot" work to provide a means
to this end.

If you want to spend the summer building a toy, fine, but I don't
see this going anywhere for production purposes.

I, the whole DBA group here, and the managers are very excited about
putting such a technology into production here. Enough to probably
be willing to find a way to do it under version 9.1 this year if at
all possible.

It would be enormously more performant for the master to be
emitting logical replication records to start with, since it
already has all the right names etc at hand at basically no cost.

Not when the consumers are across a WAN, and that WAN is the biggest
performance bottleneck and the most expensive resource involved.

I will cheerfully agree that both approaches have valuable use-cases.

-Kevin

Import Notes

Resolved by subject fallback

#24

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Kevin Grittner (#23)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sun, Apr 29, 2012 at 3:27 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

It would be enormously more performant for the master to be
emitting logical replication records to start with, since it
already has all the right names etc at hand at basically no cost.

Not when the consumers are across a WAN, and that WAN is the biggest
performance bottleneck and the most expensive resource involved.

I agree that the WAN is important, for both bandwidth and response time.

Though it isn't a given that logical change records (LCRs) will
require more bandwidth than physical WAL. WAL contains full page
images, index changes and other information that would be absent from
the LCR stream. It also depends upon the specification of the LCRs -
what metadata is included and whether the LCRs use text or binary.
Those choices have other impacts as well, so measurements and detailed
analysis is required to justify how to proceed. Which is what is in
progress now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#25

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Kevin Grittner (#23)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Simon Riggs wrote:

Though it isn't a given that logical change records (LCRs) will
require more bandwidth than physical WAL.

It would help if you could share a little more of what you have in
mind. It's not yet clear to me whether you're talking about adding
logical records to the WAL stream, replacing page-oriented records in
the WAL stream with logical records, or leaving the WAL stream alone
and generating a second, new stream of logical data.

WAL contains full page images, index changes and other information
that would be absent from the LCR stream.

Other than full page images, which could be eliminated in other ways
before streaming the WAL, what is in the WAL file that would not be
needed to maintain an exact replica of the cluster, suitable for
quick deployment in place of a lost cluster for disaster recovery? Or
do you foresee eliminating some of the current page-image oriented
WAL records in favor of logical WAL records, with an expectation that
the logical records will be smaller?

measurements and detailed analysis is required to justify how to
proceed. Which is what is in progress now.

Could you clarify? Are you referring to this discussion or some
effort at 2Q that hasn't yet been shared with the community?

-Kevin

Import Notes

Resolved by subject fallback

#26

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Kevin Grittner (#25)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sun, Apr 29, 2012 at 7:36 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Simon Riggs wrote:

Though it isn't a given that logical change records (LCRs) will
require more bandwidth than physical WAL.

It would help if you could share a little more of what you have in
mind. It's not yet clear to me whether you're talking about adding
logical records to the WAL stream, replacing page-oriented records in
the WAL stream with logical records, or leaving the WAL stream alone
and generating a second, new stream of logical data.

WAL contains full page images, index changes and other information
that would be absent from the LCR stream.

Other than full page images, which could be eliminated in other ways
before streaming the WAL, what is in the WAL file that would not be
needed to maintain an exact replica of the cluster, suitable for
quick deployment in place of a lost cluster for disaster recovery? Or
do you foresee eliminating some of the current page-image oriented
WAL records in favor of logical WAL records, with an expectation that
the logical records will be smaller?

measurements and detailed analysis is required to justify how to
proceed. Which is what is in progress now.

Could you clarify? Are you referring to this discussion or some
effort at 2Q that hasn't yet been shared with the community?

These are early days yet and I've been deliberately vague on the
format and content of the LCRs. That is to allow the idea that we have
choices to bed in a little, so we can discuss them. The one thing that
does stand clear in my mind at this stage is that the physical
transport of LCRs should be the current streaming infrastructure. So
the different aspects of the design are at differing stages of
certainty (in my mind).

Much of today has been spent on a presentation explaining the thought
processes and options available to us. Deciding between those options
needs to be fact based rather than just a matter of opinion and I want
to ensure we make the right choices, whatever they are.

I've mentioned a few times already that we're working on prototypes
all of which will be shared with the community. R&D, in that order.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#18)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face. Just for
starters, where will you get table and column names from? (Looking at
the system catalogs is cheating, and will not work reliably anyway.)

IMO, if we want non-physical replication, we're going to need to build
it in at a higher level than after-the-fact processing of WAL.
I foresee wasting quite a lot of effort on the currently proposed
approaches before we admit that they're unworkable.

I think the question we should be asking ourselves is not whether WAL
as it currently exists is adequate for logical replication, but rather
or not it could be made adequate. For example, suppose that we were
to arrange things so that, after each checkpoint, the first insert,
update, or delete record for a given relfilenode after each checkpoint
emits a special WAL record that contains the relation name, schema
OID, attribute names, and attribute type OIDs. Well, now we are much
closer to being able to do some meaningful decoding of the tuple data,
and it really doesn't cost us that much. Handling DDL (and manual
system catalog modifications) seems pretty tricky, but I'd be very
reluctant to give up on it without banging my head against the wall
pretty hard. The trouble with giving up on WAL completely and moving
to a separate replication log is that it means a whole lot of
additional I/O, which is bound to have a negative effect on
performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28

Hannu Krosing

hannu@2ndQuadrant.com

over 13 years ago

In reply to: Robert Haas (#27)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sun, 2012-04-29 at 16:33 -0400, Robert Haas wrote:

On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Translating WAL is a very hard task.

No kidding. I would think it's impossible on its face. Just for
starters, where will you get table and column names from? (Looking at
the system catalogs is cheating, and will not work reliably anyway.)

IMO, if we want non-physical replication, we're going to need to build
it in at a higher level than after-the-fact processing of WAL.
I foresee wasting quite a lot of effort on the currently proposed
approaches before we admit that they're unworkable.

I think the question we should be asking ourselves is not whether WAL
as it currently exists is adequate for logical replication, but rather
or not it could be made adequate.

Agreed.

For example, suppose that we were
to arrange things so that, after each checkpoint, the first insert,
update, or delete record for a given relfilenode after each checkpoint
emits a special WAL record that contains the relation name, schema
OID, attribute names, and attribute type OIDs.

Not just the first after checkpoint, but also the first after a schema
change, even though will duplicate the wals with changes to system
catalog, it is likely much cheaper overall to always have a fresh
structure in wal stream.

And if we really want to do WAL-->logical-->SQL_text conversion on a
host separate from the master, we also need to insert there the type
definitions of user-defined types together with at least types output
functions in some form .

So you basically need a large part of postgres for reliably making sense
of WAL.

Well, now we are much
closer to being able to do some meaningful decoding of the tuple data,
and it really doesn't cost us that much. Handling DDL (and manual
system catalog modifications) seems pretty tricky, but I'd be very
reluctant to give up on it without banging my head against the wall
pretty hard.

Most straightforward way is to have a more or less full copy of
pg_catalog also on the "WAL-filtering / WAL-conversion" node, and to use
it in 1:1 replicas of transactions recreated from the WAL .
This way we can avoid recreating any alternate views of the masters
schema.

Then again, we could do it all on master and inside the wal-writing
transaction and thus avoid large chunk of the problems.

If the receiving side is also PostgreSQL with same catalog structure
(i.e same major version) then we don't actually need to "handle DDL" in
any complicated way, it would be enough to just carry over the changes
to system tables .

The main reason we don't do it currently for trigger-based logical
replication is the restriction of not being able to have triggers on
system tables.

I hope it is much easier to have the triggerless record generation also
work on system tables.

The trouble with giving up on WAL completely and moving
to a separate replication log is that it means a whole lot of
additional I/O, which is bound to have a negative effect on
performance.

Why would you give up WAL ?

Or do you mean that the new "logical-wal" needs to have same commit time
behaviour as WAL to be reliable ?

I'd envision a scenario where the logi-wal is sent to slave or
distribution hub directly and not written at the local host at all.
An optionally sync mode similar to current sync WAL replication could be
configured. I hope this would run mostly in parallel with local WAL
generation so not much extra wall-clock time would be wasted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/

#29

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Hannu Krosing (#28)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

I think the question we should be asking ourselves is not whether WAL
as it currently exists is adequate for logical replication, but rather
or not it could be made adequate.

Agreed.

And of course I meant "but rather whether or not it could be made
adequate", but I dropped a word.

For example, suppose that we were
to arrange things so that, after each checkpoint, the first insert,
update, or delete record for a given relfilenode after each checkpoint
emits a special WAL record that contains the relation name, schema
OID, attribute names, and attribute type OIDs.

Not just the first after checkpoint, but also the first after a schema
change, even though will duplicate the wals with changes to system
catalog, it is likely much cheaper overall to always have a fresh
structure in wal stream.

Yes.

And if we really want to do WAL-->logical-->SQL_text conversion on a
host separate from the master, we also need to insert there the type
definitions of user-defined types together with at least types output
functions in some form .

Yes.

So you basically need a large part of postgres for reliably making sense
of WAL.

Agreed, but I think that's a problem we need to fix and not a
tolerable situation at all. If a user can create a type-output
function that goes and looks at the state of the database to determine
what to output, then we are completely screwed, because that basically
means you would need to have a whole Hot Standby instance up and
running just to make it possible to run type output functions. Now
you might be able to build a mechanism around that that is useful to
some people in some situations, but wow does that sound painful. What
I want is for the master to be able to cheaply rattle off the tuples
that got inserted, updated, or deleted as those things happen; needing
a whole second copy of the database just to do that does not meet my
definition of "cheap". Furthermore, it's not really clear that it's
sufficient anyway, since there are problems with what happens before
the HS instance reaches consistency, what happens when it crashes and
restarts, and how do we handle the case when the system catalog we
need to examine to generate the logical replication records is
access-exclusive-locked? Seems like a house of cards.

Some of this might be possible to mitigate contractually, by putting
limits on what type input/output functions are allowed to do. Or we
could invent a new analog of type input/output functions that is
explicitly limited in this way, and support only types that provide
it. But I think the real key is that we can't rely on catalog access:
the WAL stream has to have enough information to allow the reader to
construct some set of in-memory hash tables with sufficient detail to
reliably decode WAL. Or at least that's what I'm thinking.

Most straightforward way is to have a more or less full copy of
pg_catalog also on the "WAL-filtering / WAL-conversion" node, and to use
it in 1:1 replicas of transactions recreated from the WAL .
This way we can avoid recreating any alternate views of the masters
schema.

See above; I have serious doubts that this can ever be made to work robustly.

Then again, we could do it all on master and inside the wal-writing
transaction and thus avoid large chunk of the problems.

If the receiving side is also PostgreSQL with same catalog structure
(i.e same major version) then we don't actually need to "handle DDL" in
any complicated way, it would be enough to just carry over the changes
to system tables .

I agree it'd be preferable to handle DDL in terms of system catalog
updates, rather than saying, well, this is an ALTER TABLE .. RENAME.
But you need to be able to decode tuples using the right tuple
descriptor, even while that's changing under you.

Why would you give up WAL ?

For lack of ability to make it work. Don't underestimate how hard
it's going to nail this down.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#29)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

So you basically need a large part of postgres for reliably making sense
of WAL.

Agreed, but I think that's a problem we need to fix and not a
tolerable situation at all. If a user can create a type-output
function that goes and looks at the state of the database to determine
what to output, then we are completely screwed, because that basically
means you would need to have a whole Hot Standby instance up and
running just to make it possible to run type output functions.

You mean like enum_out? Or for that matter array_out, record_out,
range_out?

regards, tom lane

#31

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#30)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On Sun, Apr 29, 2012 at 11:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

So you basically need a large part of postgres for reliably making sense
of WAL.

Agreed, but I think that's a problem we need to fix and not a
tolerable situation at all. If a user can create a type-output
function that goes and looks at the state of the database to determine
what to output, then we are completely screwed, because that basically
means you would need to have a whole Hot Standby instance up and
running just to make it possible to run type output functions.

You mean like enum_out? Or for that matter array_out, record_out,
range_out?

Yeah, exactly. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

Jim Nasby

jim@nasby.net

over 13 years ago

In reply to: Kevin Grittner (#23)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On 4/29/12 9:27 AM, Kevin Grittner wrote:

Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data.

Thanks for the write-up, it was insightful.

One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint: if corruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might want to consider that depending on how sensitive your data is.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#33

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Jim Nasby (#32)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

One thing I wanted to mention is that non-binary replication has an
added advantage over binary from a DR standpoint: if corruption occurs
on a master it is more likely to make it into your replicas thanks to
full page writes. You might want to consider that depending on how
sensitive your data is.

Yeah, we've seen this a few times. We just recently had to rescue a
client from HS-wide corruption using Slony.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#34

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Josh Berkus (#33)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Josh Berkus <josh@agliodbs.com> wrote:

One thing I wanted to mention is that non-binary replication has
an added advantage over binary from a DR standpoint: if
corruption occurs on a master it is more likely to make it into
your replicas thanks to full page writes. You might want to
consider that depending on how sensitive your data is.

Yeah, we've seen this a few times. We just recently had to rescue
a client from HS-wide corruption using Slony.

That's an interesting point. Out of curiosity, how did the
corruption originate?

It suggests a couple questions:

(1) Was Slony running before the corruption occurred? If not, how
was Slony helpful? I know that in our environment, where we have
both going through separate streams, with a repository of the
logical transactions, we would use PITR recovery to get to the
latest known good state which we could easily identify, and then
replay the logical transactions to "top it off" to get current. If
necessary we could skip logical transactions which were problematic
results of the corruption.

(2) If logical transactions had been implemented as additions to
the WAL stream, and Slony was using that, do you think they would
still have been usable for this recovery?

Perhaps sending both physical and logical transaction streams over
the WAN isn't such a bad thing, if it gives us more independent
recovery mechanisms. That's fewer copies than we're sending with
current trigger-based techniques. It would be particularly
attractive is we could omit (filter out) certain tables before going
across the WAN. I would be willing to risk sending the big
raster-scanned documents through just the physical channel so long
as I had a nightly compare of md5sum values on both sides so we can
resend any corrupted data (or tell people to rescan).

-Kevin

#35

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Kevin Grittner (#34)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

That's an interesting point. Out of curiosity, how did the
corruption originate?

We're still not sure. It appears to be in the system catalogs, though.
Note that the original master developed memory issues.

It suggests a couple questions:

(1) Was Slony running before the corruption occurred?

No.

If not, how
was Slony helpful?

Install, replicate DB logically, new DB works fine.

(2) If logical transactions had been implemented as additions to
the WAL stream, and Slony was using that, do you think they would
still have been usable for this recovery?

Quite possibly not.

Perhaps sending both physical and logical transaction streams over
the WAN isn't such a bad thing, if it gives us more independent
recovery mechanisms. That's fewer copies than we're sending with
current trigger-based techniques.

Frankly, there's nothing wrong with the Slony model for replication
except for the overhead of:
1. triggers
2. queues
3. Running DDL

However, the three above are really big issues.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#36

Jim Nasby

jim@nasby.net

over 13 years ago

In reply to: Josh Berkus (#35)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

On 5/3/12 2:54 PM, Josh Berkus wrote:

(2) If logical transactions had been implemented as additions to

the WAL stream, and Slony was using that, do you think they would
still have been usable for this recovery?

Quite possibly not.

The key advantage that I see in londiste/slony replication is that your data stream has absolutely nothing to do with anything binary or internal to Postgres. That means that the only way corruption will travel from a master to a slave is if the corruption is in the actual fields being updated, and even that's not a given (ie: UPDATING a field to a completely new value would not propagate corruption even if the old value of the field was corrupted).

So, embedding a logical stream into WAL is not inherently bad... what would be bad is if that "logical" stream was susceptible to corruption due to something like full page writes. Simply embedding the exact same info slony or londiste captures into the WAL should be fine (though likely defeats the purpose). Translating binary WAL data into DML statements would very likely allow corruption to travel from master to slave.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net