Avoid full page images in streaming replication?

Started by Jim Nasbyover 10 years ago20 messageshackers
Jump to latest
#1Jim Nasby
Jim.Nasby@BlueTreble.com

ISTM it should be possible to avoid sending full page writes to a
streaming replica once the replica has reached a consistent state. I
assume that the replica would still need to write full pages to it's
disk in case of a crash, but the sender could insert special WAL records
to tell it when to do so, instead of sending the full page image.
Presumably this would be a big win for replication over a WAN.

Am I missing something? I see that pglesslog is no longer supported but
couldn't find any particular reason for that...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#1)
Re: Avoid full page images in streaming replication?

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

ISTM it should be possible to avoid sending full page writes to a
streaming replica once the replica has reached a consistent state. I
assume that the replica would still need to write full pages to it's
disk in case of a crash, but the sender could insert special WAL records
to tell it when to do so, instead of sending the full page image.
Presumably this would be a big win for replication over a WAN.

How are you going to make that work without LSNs in the WAL received by
the replica diverging from those in the master's WAL?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#2)
Re: Avoid full page images in streaming replication?

On Oct 22, 2015 23:38, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

ISTM it should be possible to avoid sending full page writes to a
streaming replica once the replica has reached a consistent state. I
assume that the replica would still need to write full pages to it's
disk in case of a crash, but the sender could insert special WAL records
to tell it when to do so, instead of sending the full page image.
Presumably this would be a big win for replication over a WAN.

How are you going to make that work without LSNs in the WAL received by
the replica diverging from those in the master's WAL?

We could in theory send a "this would be been a fpi but it's skipped"
record which would only exist in streaming and just make the standby write
a noop of some kind? It would still be on the standby but it would at least
not consume the bandwidth. Just skip sending the actual contents of the
fpi.

/Magnus

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Magnus Hagander (#3)
Re: Avoid full page images in streaming replication?

On 10/22/15 4:42 PM, Magnus Hagander wrote:

How are you going to make that work without LSNs in the WAL received by
the replica diverging from those in the master's WAL?

We could in theory send a "this would be been a fpi but it's skipped"
record which would only exist in streaming and just make the standby
write a noop of some kind? It would still be on the standby but it would
at least not consume the bandwidth. Just skip sending the actual
contents of the fpi.

I don't think it can be a noop on the receiver though... doesn't the
receiver still need full page images in case of a crash? (Assuming
full_page_writes is enabled...)

The other issue is chained replicas, where one of the children may need
full page writes (during initial copy).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#1)
Re: Avoid full page images in streaming replication?

On 2015-10-22 16:34:38 -0500, Jim Nasby wrote:

ISTM it should be possible to avoid sending full page writes to a streaming
replica once the replica has reached a consistent state. I assume that the
replica would still need to write full pages to it's disk in case of a
crash, but the sender could insert special WAL records to tell it when to do
so, instead of sending the full page image. Presumably this would be a big
win for replication over a WAN.

Note that FPIs are often pretty good for replay performance, avoiding
lots of synchronous random reads.

I think FPI compression is a better solution for now. I found it to be
extremely effective in some benchmarks I recently ran.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#4)
Re: Avoid full page images in streaming replication?

On Thu, Oct 22, 2015 at 5:57 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

We could in theory send a "this would be been a fpi but it's skipped"
record which would only exist in streaming and just make the standby
write a noop of some kind? It would still be on the standby but it would
at least not consume the bandwidth. Just skip sending the actual
contents of the fpi.

I don't think it can be a noop on the receiver though... doesn't the
receiver still need full page images in case of a crash? (Assuming
full_page_writes is enabled...)

Yes. If the standby is in the middle of writing a page updated by a
WAL record and crashes, it can end up with a torn page. We restart
from a restartpoint at a location where the master checkpointed so
that we can be certain that replay of the FPI will fix the problem.
If you got rid of the FPIs, you'd be dead.

This is true both before and after reaching a consistent state, which
seems like a fatal flaw in this plan.

It's true that if the standby didn't have the master's FPIs, it could
generate its own in some side location that behaves like a separate
WAL stream or a double-write buffer. But that would be a heck of a
lot of work to implement for an uncertain benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#5)
Re: Avoid full page images in streaming replication?

On 10/22/15 5:03 PM, Andres Freund wrote:

On 2015-10-22 16:34:38 -0500, Jim Nasby wrote:

ISTM it should be possible to avoid sending full page writes to a streaming
replica once the replica has reached a consistent state. I assume that the
replica would still need to write full pages to it's disk in case of a
crash, but the sender could insert special WAL records to tell it when to do
so, instead of sending the full page image. Presumably this would be a big
win for replication over a WAN.

Note that FPIs are often pretty good for replay performance, avoiding
lots of synchronous random reads.

Right. You'd only want to use this if you're streaming over a slow link
(like a WAN).

I think FPI compression is a better solution for now. I found it to be
extremely effective in some benchmarks I recently ran.

What I'm wondering is how compressible a 'normal' FPI is. Certainly if
the hole is zero'd out and the page is mostly empty you'll get great
compression. What about other workloads? For reference, if a 'FPI
placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's
12 bytes, it's 68,200% compression. (I'm assuming we write the hole too,
but maybe that's not true?)

FWIW, I started thinking about this when a client overwhelmed a remote
slave doing VACUUM FREEZE after a Slony upgrade to 9.4. Granted, that's
not normal, but it looks like normal vacuuming generates 2-6 bytes per
modified tuple (depending on what was done). So even if you vacuumed 100
rows on a page (which seems pretty high for most cases) that's only
~200-600 bytes, compared to ~8200 bytes for the FPI.

The other interesting thing is that even their local slaves (with 20Gbps
bandwidth) fell behind with vacuum_cost_delay=0, because replay was
CPU-bound. But presumably that's not due to FPIs.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#6)
Re: Avoid full page images in streaming replication?

On 10/22/15 5:11 PM, Robert Haas wrote:

It's true that if the standby didn't have the master's FPIs, it could
generate its own in some side location that behaves like a separate
WAL stream or a double-write buffer. But that would be a heck of a
lot of work to implement for an uncertain benefit.

If the receiver didn't write the WAL before processing it then it can
just stick the page image into the WAL it's writing for itself. Probably
not good for syncrep, but I don't think you'd want this on for syncrep
anyway.

But yes, this is all very hand-wavy without any actual data on what
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't
work for WAL... does anyone have a script/tool that breaks out what
percentage of a WAL file is FPIs?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#8)
Re: Avoid full page images in streaming replication?

On 2015-10-22 17:47:01 -0500, Jim Nasby wrote:

On 10/22/15 5:11 PM, Robert Haas wrote:

It's true that if the standby didn't have the master's FPIs, it could
generate its own in some side location that behaves like a separate
WAL stream or a double-write buffer. But that would be a heck of a
lot of work to implement for an uncertain benefit.

If the receiver didn't write the WAL before processing it then it can just
stick the page image into the WAL it's writing for itself. Probably not good
for syncrep, but I don't think you'd want this on for syncrep anyway.

To me this sounds like a recipe for disaster (i.e. complex bugs). WAL
(and thus CRC checksums) differing between nodes. Ugh.

But yes, this is all very hand-wavy without any actual data on what
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
for WAL... does anyone have a script/tool that breaks out what percentage of
a WAL file is FPIs?

pg_xlogdump --stats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#8)
Re: Avoid full page images in streaming replication?

Jim Nasby wrote:

But yes, this is all very hand-wavy without any actual data on what
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
for WAL... does anyone have a script/tool that breaks out what percentage of
a WAL file is FPIs?

pg_xlogdump -z

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#9)
Re: Avoid full page images in streaming replication?

On 10/22/15 5:52 PM, Andres Freund wrote:

If the receiver didn't write the WAL before processing it then it can just

stick the page image into the WAL it's writing for itself. Probably not good
for syncrep, but I don't think you'd want this on for syncrep anyway.

To me this sounds like a recipe for disaster (i.e. complex bugs). WAL
(and thus CRC checksums) differing between nodes. Ugh.

The WAL would *not* differ. This would only affect streaming
replication, and only the stream itself.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#11)
Re: Avoid full page images in streaming replication?

On 2015-10-22 17:59:06 -0500, Jim Nasby wrote:

The WAL would *not* differ.

It would. Hint bits and all.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#10)
Re: Avoid full page images in streaming replication?

On 10/22/15 5:53 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

But yes, this is all very hand-wavy without any actual data on what
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
for WAL... does anyone have a script/tool that breaks out what percentage of
a WAL file is FPIs?

pg_xlogdump -z

Hrm, any option for 9.4? I was hoping to get numbers from some real
workloads...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#13)
Re: Avoid full page images in streaming replication?

Jim Nasby wrote:

On 10/22/15 5:53 PM, Alvaro Herrera wrote:

Jim Nasby wrote:

But yes, this is all very hand-wavy without any actual data on what
percentage of the WAL stream is FPIs. Looks like pageinspect doesn't work
for WAL... does anyone have a script/tool that breaks out what percentage of
a WAL file is FPIs?

pg_xlogdump -z

Hrm, any option for 9.4? I was hoping to get numbers from some real
workloads...

You can run the new one in old pg_xlog ...

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#14)
Re: Avoid full page images in streaming replication?

On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:

You can run the new one in old pg_xlog ...

You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
that'd work?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Michael Paquier
michael@paquier.xyz
In reply to: Jim Nasby (#7)
Re: Avoid full page images in streaming replication?

On Fri, Oct 23, 2015 at 7:37 AM, Jim Nasby wrote:

What I'm wondering is how compressible a 'normal' FPI is. Certainly if the
hole is zero'd out and the page is mostly empty you'll get great
compression. What about other workloads? For reference, if a 'FPI
placeholder' WAL record is 16 bytes, that's 51,200% compression. If it's 12
bytes, it's 68,200% compression. (I'm assuming we write the hole too, but
maybe that's not true?)

Well, to begin with FPI usually avoid to include the page hole in the
middle. Now, regarding the compressibility of a page taken without its
hole, that's highly schema-dependent. Based on some measurements I did
some time ago a page with repetitive data could compress up to 40%,
with less compressible stuff like UUID I recall it to be 20~25%. I
hacked out for the FPW compression patch a module able to work
directly on raw pages to test their compressibility:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test
get_raw_page() has been taken from pageinspect and I added to it an
option to remove the hole in the middle of the page. Using that you
are able to guess how much pages can get compressed.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#15)
Re: Avoid full page images in streaming replication?

On Fri, Oct 23, 2015 at 8:42 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:

You can run the new one in old pg_xlog ...

You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
that'd work?

That's not going to work.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#15)
Re: Avoid full page images in streaming replication?

Andres Freund wrote:

On 2015-10-22 20:27:20 -0300, Alvaro Herrera wrote:

You can run the new one in old pg_xlog ...

You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
that'd work?

Oh, crap. Must have been some other cross-version trial run I did,
then. I would hope it's at least not terribly difficult to back-patch
that commit locally, anyway.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Bernd Helmle
mailings@oopsware.de
In reply to: Alvaro Herrera (#18)
Re: Avoid full page images in streaming replication?

--On 22. Oktober 2015 22:23:58 -0300 Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

You can? The xlog format between 9.4 and 9.5 changed, so I can't see how
that'd work?

Oh, crap. Must have been some other cross-version trial run I did,
then. I would hope it's at least not terribly difficult to back-patch
that commit locally, anyway.

You might want to try

<https://github.com/credativ/pg_xlog_analyzer&gt;

Adrian created it basically to track down archive replay performance issues
a while ago. Maybe it's useful somehow.

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Bernd Helmle
mailings@oopsware.de
In reply to: Andres Freund (#5)
Re: Avoid full page images in streaming replication?

--On 23. Oktober 2015 00:03:30 +0200 Andres Freund <andres@anarazel.de>
wrote:

Note that FPIs are often pretty good for replay performance, avoiding
lots of synchronous random reads.

That's a very import argument, i think. The difference can be significant,
even if you have a decent storage, rendering a replica unusable. We had
examples in the past where only optimizing aggressive writes of FPIs lead
to an acceptable lag of the replica.

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers