AdvanceXLInsertBuffer vs. WAL segment compressibility

Started by Chapman Flackover 9 years ago49 messages
#1Chapman Flack
chap@anastigmatix.net

Teaser: change made in 9.4 to simplify WAL segment compression
made it easier to compress a low-activity-period WAL segment
from 16 MB to about 27 kB ... but much harder to do better than
that, as I was previously doing (about two orders of magnitude
better).

At $work, we have a usually-low-activity PG database, so that almost
always the used fraction of each 16 MB WAL segment is far smaller
than 16 MB, and so it's a big win for archived-WAL storage space
if an archive-command can be written that compresses those files
effectively.

Our database was also running on a pre-9.4 version, and I'm
currently migrating to 9.5.3. As I understand it, 9.4 was where
commit 9a20a9b landed, which changed what happens in the unwritten
'tail' of log segments.

In my understanding, before 9.4, the 'tail' of any log segment
on disk just wasn't written, and so (as segment recycling simply
involves renaming a file that held some earlier segment), the
remaining content was simply whatever had been there before
recycling. That was never a problem for recovery (which could
tell when it reached the end of real data), but was not well
compressible with a generic tool like gzip. Specialized tools
like pg_clearxlogtail existed, but had to know too much about
the internal format, and ended up unmaintained and therefore
difficult to trust.

The change in 9.4 included this, from the git comment:

This has one user-visible change: switching to a new WAL segment
with pg_switch_xlog() now fills the remaining unused portion of
the segment with zeros.

... thus making the segments easily compressible with bog standard
tools. So I can just point gzip at one of our WAL segments from a
light-activity period and it goes from 16 MB down to about 27 kB.
Nice, right?

But why does it break my earlier approach, which was doing about
two orders of magnitude better, getting low-activity WAL segments
down to 200 to 300 *bytes*? (Seriously: my last solid year of
archived WAL is contained in a 613 MB zip file.)

That approach was based on using rsync (also bog standard) to
tease apart the changed and unchanged bits of the newly-archived
segment and the last-seen content of the file with the same
i-number. You would expect that to work just as well when the
tail is always zeros as it was working before, right?

And what's breaking it now is the tiny bit of fine
print that's in the code comment for AdvanceXLInsertBuffer but
not in the git comment above:

* ... Any new pages are initialized to zeros, with pages headers
* initialized properly.

That innocuous "headers initialized" means that the tail of the
file is *almost* all zeros, but every 8 kB there is a tiny header,
and in each tiny header, there is *one byte* that differs from
its value in the pre-recycle content at the same i-node, because
that one byte in each header reflects the WAL segment number.

Before the 9.4 change, I see there were still headers there,
and they did contain a byte matching the segment number, but in
the unwritten portion of course it matched the pre-recycle
segment number, and rsync easily detected the whole unchanged
tail of the file. Now there is one changed byte every 8 kB,
and the rsync output, instead of being 100x better than vanilla
gzip, is about 3x worse.

Taking a step back, isn't overwriting the whole unused tail of
each 16 MB segment really just an I/O intensive way of communicating
to the archive-command where the valid data ends? Could that not
be done more efficiently by adding another code, say %e, in
archive-command, that would be substituted by the offset of the
end of the XLOG_SWITCH record? That way, however archive-command
is implemented, it could simply know how much of the file to
copy.

Would it then be possible to go back to the old behavior (or make
it selectable) of not overwriting the full 16 MB every time?
Or did the 9.4 changes also change enough other logic that stuff
would now break if that isn't done?

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Chapman Flack (#1)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Sat, Jul 23, 2016 at 3:32 AM, Chapman Flack <chap@anastigmatix.net> wrote:

Would it then be possible to go back to the old behavior (or make
it selectable) of not overwriting the full 16 MB every time?

I don't see going back to old behaviour is an improvement, because as
as you pointed out above that it helps to improve the compression
ratio of WAL files for tools like gzip and it doesn't seem advisable
to loose that capability. I think providing an option to select that
behaviour could be one choice, but use case seems narrow to me
considering there are tools (pglesslog) to clear the tail. Do you
find any problems with that tool which makes you think that it is not
reliable?

Or did the 9.4 changes also change enough other logic that stuff
would now break if that isn't done?

The changes related to the same seems to be isolated (mainly in
CopyXLogRecordToWAL()) and doesn't look to impact other parts of
system, although some more analysis is needed to confirm the same, but
I think the point to make it optional doesn't seem convincing to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Chapman Flack
chap@anastigmatix.net
In reply to: Amit Kapila (#2)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/23/2016 08:25 AM, Amit Kapila wrote:

On Sat, Jul 23, 2016 at 3:32 AM, Chapman Flack <chap@anastigmatix.net> wrote:

Would it then be possible to go back to the old behavior (or make
it selectable) of not overwriting the full 16 MB every time?

I don't see going back to old behaviour is an improvement, because as
as you pointed out above that it helps to improve the compression
ratio of WAL files for tools like gzip and it doesn't seem advisable
to loose that capability. I think providing an option to select that
behaviour could be one choice, but use case seems narrow to me
considering there are tools (pglesslog) to clear the tail. Do you
find any problems with that tool which makes you think that it is not
reliable?

It was a year or so ago when I was surveying tools that attempted
to do that. I had found pg_clearxlogtail, and I'm sure I also found
pglesslog / pg_compresslog ... my notes from then simply refer to
"contrib efforts like pg_clearxlogtail" and observed either a dearth
of recent search results for them, or a predominance of results
of the form "how do I get this to compile for PG x.x?"

pg_compresslog is mentioned in a section, Compressed Archive Logs,
of the PG 9.1 manual:
https://www.postgresql.org/docs/9.1/static/continuous-archiving.html#COMPRESSED-ARCHIVE-LOGS

That section is absent in the docs any version > 9.1.

The impression that leaves is of tools that relied too heavily
on internal format knowledge to be viable outside of core, which
have had at least periods of incompatibility with newer PG versions,
and whose current status, if indeed any are current, isn't easy
to find out.

It seems a bit risky (to me, anyway) to base a backup strategy
on having a tool in the pipeline that depends so heavily on
internal format knowledge, can become uncompilable between PG
releases, and isn't part of core and officially supported.

And that, I assume, was also the motivation to put the zeroing
in AdvanceXLInsertBuffer, eliminating the need for one narrow,
specialized tool like pg{_clear,_compress,less}log{,tail}, so
the job can be done with ubiquitous, bog standard (and therefore
*very* exhaustively tested) tools like gzip.

So it's just kind of unfortunate that there used to be a *further*
factor of 100 (nothing to sneeze at) possible using rsync
(another non-PG-specific, ubiquitous, exhaustively tested tool)
but a trivial feature of the new behavior has broken that.

Factors of 100 are enough to change the sorts of things you think
about, like possibly retaining years-long unbroken histories of
transactions in WAL.

What would happen if the overwriting of the log tail were really
done with just zeros, as the git comment implied, rather than zeros
with initialized headers? Could the log-reading code handle that
gracefully? That would support all forms of non-PG-specific,
ubiquitous tools used for compression; it would not break the rsync
approach.

Even so, it still seems to me that a cheaper solution is a %e
substitution in archive_command: just *tell* the command where
the valid bytes end. Accomplishes the same thing as ~ 16 MB
of otherwise-unnecessary I/O at the time of archiving each
lightly-used segment.

Then the actual zeroing could be suppressed to save I/O, maybe
with a GUC variable, or maybe just when archive_command is seen
to contain a %e. Commands that don't have a %e continue to work
and compress effectively because of the zeroing.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Chapman Flack (#3)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Mon, Jul 25, 2016 at 11:21 PM, Chapman Flack <chap@anastigmatix.net> wrote:

The impression that leaves is of tools that relied too heavily
on internal format knowledge to be viable outside of core, which
have had at least periods of incompatibility with newer PG versions,
and whose current status, if indeed any are current, isn't easy
to find out.

WAL format has gone through a lot of changes in 9.4 as well. 9.3 has
as well introduced xlogreader.c which is what *any* client trying to
read WAL into an understandable format should use.

And that, I assume, was also the motivation to put the zeroing
in AdvanceXLInsertBuffer, eliminating the need for one narrow,
specialized tool like pg{_clear,_compress,less}log{,tail}, so
the job can be done with ubiquitous, bog standard (and therefore
*very* exhaustively tested) tools like gzip.

Exactly, and honestly this has been a huge win to make such segments
more compressible.

Even so, it still seems to me that a cheaper solution is a %e
substitution in archive_command: just *tell* the command where
the valid bytes end. Accomplishes the same thing as ~ 16 MB
of otherwise-unnecessary I/O at the time of archiving each
lightly-used segment.

Then the actual zeroing could be suppressed to save I/O, maybe
with a GUC variable, or maybe just when archive_command is seen
to contain a %e. Commands that don't have a %e continue to work
and compress effectively because of the zeroing.

This is over-complicating things for little gain. The new behavior of
filling in with zeros the tail of a segment makes things far better
when using gzip in archive_command.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#4)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/25/16 22:09, Michael Paquier wrote:

This is over-complicating things for little gain. The new behavior of
filling in with zeros the tail of a segment makes things far better
when using gzip in archive_command.

Then how about filling with actual zeros, instead of with mostly-zeros
as is currently done? That would work just as well for gzip, and would
not sacrifice the ability to do 100x better than gzip.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Chapman Flack (#5)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Tue, Jul 26, 2016 at 9:02 AM, Chapman Flack <chap@anastigmatix.net> wrote:

On 07/25/16 22:09, Michael Paquier wrote:

This is over-complicating things for little gain. The new behavior of
filling in with zeros the tail of a segment makes things far better
when using gzip in archive_command.

Then how about filling with actual zeros, instead of with mostly-zeros
as is currently done? That would work just as well for gzip, and would
not sacrifice the ability to do 100x better than gzip.

There is a flag XLP_BKP_REMOVABLE for the purpose of ignoring empty
blocks, any external tool/'s relying on it can break, if make
everything zero. Now, it might be possible to selectively initialize
the fields that doesn't harm the methodology for archive you are using
considering there is no other impact of same in code. However, it
doesn't look to be a neat way to implement the requirement. In
general, if you have a very low WAL activity, then the final size of
compressed WAL shouldn't be much even if you use gzip. It seems your
main concern is that the size of WAL even though not high, but it is
more than what you were earlier getting for your archive data. I
think that is a legitimate concern, but I don't see much options apart
for providing some selective way to not initialize everything in WAL
page headers or have some tool like pg_lesslog that can be shipped as
part of contrib module. I am not sure whether your use case is
important enough to proceed with one of those options or may be
consider some another approach. Does any body else see the use case
reported by Chapman important enough that we try to have some solution
for it in-core?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Chapman Flack
chap@anastigmatix.net
In reply to: Amit Kapila (#6)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/26/2016 08:48 AM, Amit Kapila wrote:

general, if you have a very low WAL activity, then the final size of
compressed WAL shouldn't be much even if you use gzip. It seems your

9.5 pg_xlog, low activity test cluster (segment switches forced
only by checkpoint timeouts), compressed with gzip -9:

$ for i in 0*; do echo -n "$i " && gzip -9 <$i | wc -c; done
000000010000000100000042 27072
000000010000000100000043 27075
000000010000000100000044 27077
000000010000000100000045 27073
000000010000000100000046 27075

Log from live pre-9.4 cluster, low-activity time of day, delta
compression using rsync:

2016-07-26 03:54:02 EDT (walship) INFO: using 2.39s user, 0.4s system,
9.11s on
wall:
231 byte 000000010000004600000029_000000010000004600000021_fwd
...
2016-07-26 04:54:01 EDT (walship) INFO: using 2.47s user, 0.4s system,
8.43s on
wall:
232 byte 00000001000000460000002A_000000010000004600000022_fwd
...
2016-07-26 05:54:02 EDT (walship) INFO: using 2.56s user, 0.29s system,
9.44s on
wall:
230 byte 00000001000000460000002B_000000010000004600000023_fwd

So when I say "factor of 100", I'm understating slightly. (Those
timings, for the curious, include sending a copy offsite via ssh.)

everything zero. Now, it might be possible to selectively initialize
the fields that doesn't harm the methodology for archive you are using
considering there is no other impact of same in code. However, it

Indeed, it is only the one header field that duplicates the low-
order part of the (hex) file name that breaks delta compression,
because it has always been incremented even when nothing else is
different, and it's scattered 2000 times through the file.
Would it break anything for *that* to be zero in dummy blocks?

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Chapman Flack (#1)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Fri, Jul 22, 2016 at 6:02 PM, Chapman Flack <chap@anastigmatix.net> wrote:

At $work, we have a usually-low-activity PG database, so that almost
always the used fraction of each 16 MB WAL segment is far smaller
than 16 MB, and so it's a big win for archived-WAL storage space
if an archive-command can be written that compresses those files
effectively.

I'm kind of curious WHY you are using archiving and forcing regular
segment switches rather than just using streaming replication.
Pre-9.0, use of archive_timeout was routine, since there was no other
way to ensure that the data ended up someplace other than your primary
with reasonable regularity. But, AFAIK, streaming replication
essentially obsoleted that use case. You can just dribble the
individual bytes over the wire a few at a time to the standby or, with
pg_receivexlog, to an archive location. If it takes 6 months to fill
up a WAL segment, you don't care: you'll always have all the bytes
that were generated more than a fraction of a second before the master
melted into a heap of slag.

I'm not saying you don't have a good reason for doing what you are
doing, just that I cannot think of one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Chapman Flack
chap@anastigmatix.net
In reply to: Robert Haas (#8)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/26/2016 04:21 PM, Robert Haas wrote:

I'm kind of curious WHY you are using archiving and forcing regular
segment switches rather than just using streaming replication.
... AFAIK, streaming replication
essentially obsoleted that use case. You can just dribble the
individual bytes over the wire a few at a time to the standby or, with
pg_receivexlog, to an archive location. If it takes 6 months to fill
up a WAL segment, you don't care: you'll always have all the bytes

Part of it is just the legacy situation: at the moment, the offsite
host is of a different architecture and hasn't got PostgreSQL
installed (but it's easily ssh'd to for delivering compressed WAL
segments). We could change that down the road, and pg_receivexlog
would work for getting the bytes over there.

My focus for the moment was just on migrating a cluster to 9.5
without changing the surrounding arrangements all at once.
Seeing how much worse our compression ratio will be, though,
maybe I need to revisit that plan.

Even so, I'd be curious whether it would break anything to have
xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
pages written to fill out a segment. At least until it's felt
that archive_timeout has been so decidedly obsoleted by streaming
replication that it is removed, and the log-tail zeroing code
with it.

That at least would eliminate the risk of anyone else repeating
my astonishment. :) I had read that 9.4 added built-in log-zeroing
code, and my first reaction was "cool! that may make the compression
technique we're using unnecessary, but certainly can't make it worse"
only to discover that it did, by ~ 300x, becoming now 3x *worse* than
plain gzip, which itself is ~ 100x worse than what we had.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#6)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Tue, Jul 26, 2016 at 9:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Does any body else see the use case
reported by Chapman important enough that we try to have some solution
for it in-core?

The lack of updates in the pg_lesslog project is a sign that it is not
that much used. I does not seem a good idea to bring in-core a tool
not used that much by users.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#10)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/26/16 20:01, Michael Paquier wrote:

On Tue, Jul 26, 2016 at 9:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Does any body else see the use case
reported by Chapman important enough that we try to have some solution
for it in-core?

The lack of updates in the pg_lesslog project is a sign that it is not
that much used. I does not seem a good idea to bring in-core a tool
not used that much by users.

Effectively, it already was brought in-core in commit 9a20a9b.
Only, that change had an unintended consequence that *limits*
compressibility - and it would not have that consequence, if
it were changed to simply set xlp_pageaddr to InvalidXLogRecPtr
in the dummy zero pages written to fill out a segment.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Bruce Momjian
bruce@momjian.us
In reply to: Chapman Flack (#9)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote:

Even so, I'd be curious whether it would break anything to have
xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
pages written to fill out a segment. At least until it's felt
that archive_timeout has been so decidedly obsoleted by streaming
replication that it is removed, and the log-tail zeroing code
with it.

That at least would eliminate the risk of anyone else repeating
my astonishment. :) I had read that 9.4 added built-in log-zeroing
code, and my first reaction was "cool! that may make the compression
technique we're using unnecessary, but certainly can't make it worse"
only to discover that it did, by ~ 300x, becoming now 3x *worse* than
plain gzip, which itself is ~ 100x worse than what we had.

My guess is that the bytes are there to detect problems where a 512-byte
disk sector is zeroed by a disk failure. I don't see use changing that
for the use-case you have described.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Chapman Flack
chap@anastigmatix.net
In reply to: Bruce Momjian (#12)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 08/02/2016 02:33 PM, Bruce Momjian wrote:

My guess is that the bytes are there to detect problems where
a 512-byte disk sector is zeroed by a disk failure.

Does that seem plausible? (a) there is only one such header for
every 16 512-byte disk sectors, so it only affords a 6% chance of
detecting a zeroed sector, and (b) the header contains other
non-zero values in fields other than xlp_pageaddr, so the use
of a fixed value for _that field_ in zeroed tail blocks would
not prevent (or even reduce the 6% probability of) detecting
a sector zeroed by a defect.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#12)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Tue, Aug 2, 2016 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote:

Even so, I'd be curious whether it would break anything to have
xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
pages written to fill out a segment. At least until it's felt
that archive_timeout has been so decidedly obsoleted by streaming
replication that it is removed, and the log-tail zeroing code
with it.

That at least would eliminate the risk of anyone else repeating
my astonishment. :) I had read that 9.4 added built-in log-zeroing
code, and my first reaction was "cool! that may make the compression
technique we're using unnecessary, but certainly can't make it worse"
only to discover that it did, by ~ 300x, becoming now 3x *worse* than
plain gzip, which itself is ~ 100x worse than what we had.

My guess is that the bytes are there to detect problems where a 512-byte
disk sector is zeroed by a disk failure. I don't see use changing that
for the use-case you have described.

Is there actually any code that makes such a check?

I'm inclined to doubt that was the motivation, though admittedly we're
both speculating about the contents of Heikki's brain, a tricky
proposition on a good day.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#14)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

(I'm cleaning up my inbox, hence the delayed reply)

On 08/02/2016 10:51 PM, Robert Haas wrote:

On Tue, Aug 2, 2016 at 2:33 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Jul 26, 2016 at 05:42:43PM -0400, Chapman Flack wrote:

Even so, I'd be curious whether it would break anything to have
xlp_pageaddr simply set to InvalidXLogRecPtr in the dummy zero
pages written to fill out a segment. At least until it's felt
that archive_timeout has been so decidedly obsoleted by streaming
replication that it is removed, and the log-tail zeroing code
with it.

That at least would eliminate the risk of anyone else repeating
my astonishment. :) I had read that 9.4 added built-in log-zeroing
code, and my first reaction was "cool! that may make the compression
technique we're using unnecessary, but certainly can't make it worse"
only to discover that it did, by ~ 300x, becoming now 3x *worse* than
plain gzip, which itself is ~ 100x worse than what we had.

My guess is that the bytes are there to detect problems where a 512-byte
disk sector is zeroed by a disk failure. I don't see use changing that
for the use-case you have described.

Is there actually any code that makes such a check?

I'm inclined to doubt that was the motivation, though admittedly we're
both speculating about the contents of Heikki's brain, a tricky
proposition on a good day.

Given that we used to just leave them as garbage, it seems pretty safe
to zero them out now.

It's kind of nice that all the XLOG pages have valid page headers. One
way to think of the WAL switch record is that it's a very large WAL
record that just happens to consume the rest of the WAL segment. Except
that it's not actually represented like that; the xl_tot_len field of an
XLOG switch record does not include the zeroed out portion. Instead,
there's special handling in the reader code, that skips to the end of
the segment when it sees a switch record. So that point is moot.

When I wrote that code, I don't remember if I realized that we're
initializing the page headers, or if I thought that it's good enough
even if we do. I guess I didn't realize it, because a comment would've
been in order if it was intentional.

So +1 on fixing that, a patch would be welcome. In the meanwhile, have
you tried using a different compression program? Something else than
gzip might do a better job at the almost zero pages.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Chapman Flack
chap@anastigmatix.net
In reply to: Heikki Linnakangas (#15)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 06/21/17 04:51, Heikki Linnakangas wrote:

(I'm cleaning up my inbox, hence the delayed reply)

I had almost completely forgotten ever bringing it up. :)

When I wrote that code, I don't remember if I realized that we're
initializing the page headers, or if I thought that it's good enough even if
we do. I guess I didn't realize it, because a comment would've been in order
if it was intentional.

So +1 on fixing that, a patch would be welcome.

Ok, that sounds like something I could take a whack at. Overall, xlog.c
is a bit daunting, but this particular detail seems fairly approachable.

In the meanwhile, have you
tried using a different compression program? Something else than gzip might
do a better job at the almost zero pages.

Well, gzip was doing pretty well; it could get a 16 MB segment file down
to under 27 kB, or less than 14 bytes for each of 2000 pages, when a page
header is what, 20 bytes, it looks like? I'm not sure how much better
I'd expect a (non-custom) compression scheme to do. The real difference
comes between compressing (even well) a large unchanged area, versus being
able to recognize (again with a non-custom tool) that the whole area is
unchanged.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#16)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

I notice CopyXLogRecordToWAL contains this loop (in the case where
the record being copied is a switch):

while (CurrPos < EndPos)
{
/* initialize the next page (if not initialized already) */
WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(CurrPos, false);
CurrPos += XLOG_BLCKSZ;
}

in which it calls, one page at a time, AdvanceXLInsertBuffer, which contains
its own loop able to do a sequence of pages. A comment explains why:

/*
* We do this one page at a time, to make sure we don't deadlock
* against ourselves if wal_buffers < XLOG_SEG_SIZE.
*/

I want to make sure I understand what the deadlock potential is
in this case. AdvanceXLInsertBuffer will call WaitXLogInsertionsToFinish
before writing any dirty buffer, and we do hold insertion slot locks
(all of 'em, in the case of a log switch, because that makes
XlogInsertRecord call WALInsertLockAcquireExclusive instead of just
WALInsertLockAcquire for other record types).

Does not the fact we hold all the insertion slots exclude the possibility
that any dirty buffer (preceding the one we're touching) needs to be checked
for in-flight insertions?

I've been thinking along the lines of another parameter to
AdvanceXLInsertBuffer to indicate when the caller is exactly this loop
filling out the tail after a log switch (originally, to avoid filling
in page headers). It now seems to me that, if AdvanceXLInsertBuffer
has that information, it could also be safe for it to skip the
WaitXLogInsertionsToFinish in that case. Would that eliminate the
deadlock potential, and allow the loop in CopyXLogRecordToWAL to be
replaced with a single call to AdvanceXLInsertBuffer and a single
WALInsertLockUpdateInsertingAt ?

Or have I overlooked some other subtlety?

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Chapman Flack
chap@anastigmatix.net
In reply to: Chapman Flack (#17)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 06/25/17 21:20, Chapman Flack wrote:

I want to make sure I understand what the deadlock potential is
in this case. AdvanceXLInsertBuffer will call WaitXLogInsertionsToFinish
...
Does not the fact we hold all the insertion slots exclude the possibility
that any dirty buffer (preceding the one we're touching) needs to be checked
for in-flight insertions? [in the filling-out-the-log-tail case only]

Anyone?

Or have I not even achieved 'wrong' yet?

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Chapman Flack (#17)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 06/26/2017 04:20 AM, Chapman Flack wrote:

I notice CopyXLogRecordToWAL contains this loop (in the case where
the record being copied is a switch):

while (CurrPos < EndPos)
{
/* initialize the next page (if not initialized already) */
WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(CurrPos, false);
CurrPos += XLOG_BLCKSZ;
}

in which it calls, one page at a time, AdvanceXLInsertBuffer, which contains
its own loop able to do a sequence of pages. A comment explains why:

/*
* We do this one page at a time, to make sure we don't deadlock
* against ourselves if wal_buffers < XLOG_SEG_SIZE.
*/

I want to make sure I understand what the deadlock potential is
in this case. AdvanceXLInsertBuffer will call WaitXLogInsertionsToFinish
before writing any dirty buffer, and we do hold insertion slot locks
(all of 'em, in the case of a log switch, because that makes
XlogInsertRecord call WALInsertLockAcquireExclusive instead of just
WALInsertLockAcquire for other record types).

Does not the fact we hold all the insertion slots exclude the possibility
that any dirty buffer (preceding the one we're touching) needs to be checked
for in-flight insertions?

Hmm. That's not the problem, though. Imagine that instead of the loop
above, you do just:

WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(EndPos, false);

AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages
before EndPos, to make room in the wal_buffers for the new pages. Before
doing that, it will call WaitXLogInsertionsToFinish() to wait for any
insertions to those pages to be completed. But the backend itself is
advertising the insertion position CurrPos, and it will therefore wait
for itself, forever.

I've been thinking along the lines of another parameter to
AdvanceXLInsertBuffer to indicate when the caller is exactly this loop
filling out the tail after a log switch (originally, to avoid filling
in page headers). It now seems to me that, if AdvanceXLInsertBuffer
has that information, it could also be safe for it to skip the
WaitXLogInsertionsToFinish in that case. Would that eliminate the
deadlock potential, and allow the loop in CopyXLogRecordToWAL to be
replaced with a single call to AdvanceXLInsertBuffer and a single
WALInsertLockUpdateInsertingAt ?

Or have I overlooked some other subtlety?

The most straightforward solution would be to just clear each page with
memset() in the loop. It's a bit wasteful to clear the page again, just
after AdvanceXLInsertBuffer() has initialized it, but this isn't
performance-critical.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Chapman Flack
chap@anastigmatix.net
In reply to: Heikki Linnakangas (#19)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:

The most straightforward solution would be to just clear each page with
memset() in the loop. It's a bit wasteful to clear the page again, just
after AdvanceXLInsertBuffer() has initialized it, but this isn't
performance-critical.

An in that straightforward approach, I imagine it would suffice to
memset just the length of a (short) page header; the page content
is already zeroed, and there isn't going to be a switch at the very
start of a segment, so a long header won't be encountered ... will it?

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Chapman Flack
chap@anastigmatix.net
In reply to: Heikki Linnakangas (#19)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:

Hmm. That's not the problem, though. Imagine that instead of the loop
above, you do just:

WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(EndPos, false);

AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages
before EndPos, to make room in the wal_buffers for the new pages. Before
doing that, it will call WaitXLogInsertionsToFinish()

Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
would tell it not only to skip initializing headers, but also to
skip the WaitXLogInsertionsToFinish() check ... because I have
the entire region reserved and I hold all the writer slots
at that moment, it seems safe to assure AdvanceXLInsertBuffer()
that there are no outstanding writes to wait for.

I suppose it's true there's not much performance to gain; it would
save a few pairs of lock operations per empty page to be written,
but then, the more empty pages there are at the time of a log switch,
the less busy the system is, so the less it matters.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Chapman Flack (#21)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/03/2017 06:30 PM, Chapman Flack wrote:

On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:

Hmm. That's not the problem, though. Imagine that instead of the loop
above, you do just:

WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(EndPos, false);

AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages
before EndPos, to make room in the wal_buffers for the new pages. Before
doing that, it will call WaitXLogInsertionsToFinish()

Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
would tell it not only to skip initializing headers, but also to
skip the WaitXLogInsertionsToFinish() check ... because I have
the entire region reserved and I hold all the writer slots
at that moment, it seems safe to assure AdvanceXLInsertBuffer()
that there are no outstanding writes to wait for.

Yeah, I suppose that would work, too.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Michael Paquier
michael.paquier@gmail.com
In reply to: Chapman Flack (#16)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Fri, Jun 23, 2017 at 6:08 AM, Chapman Flack <chap@anastigmatix.net> wrote:

Well, gzip was doing pretty well; it could get a 16 MB segment file down
to under 27 kB, or less than 14 bytes for each of 2000 pages, when a page
header is what, 20 bytes, it looks like? I'm not sure how much better
I'd expect a (non-custom) compression scheme to do. The real difference
comes between compressing (even well) a large unchanged area, versus being
able to recognize (again with a non-custom tool) that the whole area is
unchanged.

Have you tried as well lz4 for your cases? It performs faster than
gzip at minimum compression and compresses less, but I am really
wondering if for almost zero pages it performs actually better.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On Thu, Jul 6, 2017 at 3:48 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/03/2017 06:30 PM, Chapman Flack wrote:

Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
would tell it not only to skip initializing headers, but also to
skip the WaitXLogInsertionsToFinish() check ... because I have
the entire region reserved and I hold all the writer slots
at that moment, it seems safe to assure AdvanceXLInsertBuffer()
that there are no outstanding writes to wait for.

Yeah, I suppose that would work, too.

FWIW, I would rather see any optimization done in
AdvanceXLInsertBuffer() instead of seeing a second memset re-zeroing
the WAL page header after its data has been initialized by
AdvanceXLInsertBuffer() once. That's too late for 10, but you still
have time for a patch to be integrated in 11.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#24)
Re: AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/17/17 11:29, Michael Paquier wrote:

FWIW, I would rather see any optimization done in
AdvanceXLInsertBuffer() instead of seeing a second memset re-zeroing
the WAL page header after its data has been initialized by
AdvanceXLInsertBuffer() once.

Is that an aesthetic 'rather', or is there a technical advantage you
have in mind?

I also began by looking at how to stop AdvanceXLInsertBuffer()
initializing headers and taking locks when neither is needed.
But Heikki's just-rezero-them suggestion has a definite simplicity
advantage. It can be implemented entirely with a tight group of
lines added to CopyXLogRecordToWAL, as opposed to modifying
AdvanceXLInsertBuffer in a few distinct places, adding a parameter,
and changing its call sites.

There's a technical appeal to making the changes in AdvanceXLInsertBuffer
(who wants to do unnecessary initialization and locking?), but the amount
of unnecessary work that can be avoided is proportional to the number of
unused pages at switch time, meaning it is largest when the system
is least busy, and may be of little practical concern.

Moreover, optimizing AdvanceXLInsertBuffer would reveal one more
complication: some of the empty pages about to be written out may
have been initialized opportunistically in earlier calls to
AdvanceXLInsertBuffer, so those already have populated headers, and
would need rezeroing anyway. And not necessarily just an insignificant
few of them: if XLOGChooseNumBuffers chose the maximum, it could even
be all of them.

That might also be handled by yet another conditional within
AdvanceXLInsertBuffer. But with all of that in view, maybe it is
just simpler to have one loop in CopyXLogRecordToWAL simply zero them all,
and leave AdvanceXLInsertBuffer alone, so no complexity is added when it
is called from other sites that are arguably hotter.

Zeroing SizeOfXLogShortPHD bytes doesn't cost a whole lot.

-Chap

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#24)
1 attachment(s)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 07/17/17 11:29, Michael Paquier wrote:

On Thu, Jul 6, 2017 at 3:48 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/03/2017 06:30 PM, Chapman Flack wrote:

Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
...

Yeah, I suppose that would work, too.

FWIW, I would rather see any optimization done in
AdvanceXLInsertBuffer() instead of seeing a second memset re-zeroing
the WAL page header after its data has been initialized by
AdvanceXLInsertBuffer() once.

Here is a patch implementing the simpler approach Heikki suggested
(though my original leaning had been to wrench on AdvanceXLInsertBuffer
as Michael suggests). The sheer simplicity of the smaller change
eventually won me over, unless there's a strong objection.

As noted before, the cost of the extra small MemSet is proportional
to the number of unused pages in the segment, and that is an indication
of how busy the system *isn't*. I don't have a time benchmark of the
patch's impact; if I should, what would be a good methodology?

Before the change, what some common compression tools can achieve on
a mostly empty (select pg_switch_wal()) segment on my hardware are:

gzip 27052 in 0.145s
xz 5852 in 0.678s
lzip 5747 in 1.254s
bzip2 1445 in 0.261s

bzip2 is already the clear leader (I don't have lz4 handy to include in
the comparison) at around 1/20th size gzip can achieve, and that's before
this patch. After:

gzip 16393 in 0.143s
xz 2640 in 0.520s
lzip 2535 in 1.198s
bzip2 147 in 0.238s

The patch gives gzip almost an extra factor of two, and about the same
for xz and lzip, and bzip2 gains nearly another order of magnitude.

I considered adding a regression test for unfilled-segment compressibility,
but wasn't sure where it would most appropriately go. I assume a TAP test
would be the way to do it.

-Chap

Attachments:

0001-Zero-headers-of-unused-pages-after-WAL-switch.patchtext/x-patch; name=0001-Zero-headers-of-unused-pages-after-WAL-switch.patchDownload
From 35684bdaa1d27c3f4b7198541f3c92bb2b4cb2f4 Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Sun, 25 Feb 2018 11:44:47 -0500
Subject: [PATCH] Zero headers of unused pages after WAL switch.

When writing zeroed pages to the remainder of a WAL segment
after a WAL switch, ensure that the headers of those pages are
also zeroed, as their initialized values otherwise reduce the
compressibility of the WAL segment file by general tools.
---
 src/backend/access/transam/xlog.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d..a91ec7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1556,7 +1556,16 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		{
 			/* initialize the next page (if not initialized already) */
 			WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * Fields in the page header preinitialized by AdvanceXLInsertBuffer
+			 * to nonconstant values reduce the compressibility of WAL segments,
+			 * and aren't needed in the freespace following a switch record.
+			 * Re-zero that header area. This is not performance-critical, as
+			 * the more empty pages there are for this loop to touch, the less
+			 * busy the system is.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
 			CurrPos += XLOG_BLCKSZ;
 		}
 	}
-- 
2.7.3

#27Daniel Gustafsson
daniel@yesql.se
In reply to: Chapman Flack (#26)
1 attachment(s)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 25 Feb 2018, at 18:22, Chapman Flack <chap@anastigmatix.net> wrote:

Here is a patch implementing the simpler approach Heikki suggested
(though my original leaning had been to wrench on AdvanceXLInsertBuffer
as Michael suggests). The sheer simplicity of the smaller change
eventually won me over, unless there's a strong objection.

I had a look at this patch today. The patch applies on current HEAD, and has
ample documentation in the included comment. All existing tests pass, but
there are no new tests added (more on that below). While somewhat outside my
wheelhouse, the implementation is the simple solution with no apparent traps
that I can think of.

Regarding the problem statement of the patch. The headers on the zeroed
segments are likely an un-intended side effect, and serve no real purpose.
While they aren’t a problem to postgres, they do reduce the compressability of
the resulting files as evidenced by the patch author.

As noted before, the cost of the extra small MemSet is proportional
to the number of unused pages in the segment, and that is an indication
of how busy the system *isn't*. I don't have a time benchmark of the
patch's impact; if I should, what would be a good methodology?

This codepath doesn’t seem performance critical enough to warrant holding off
the patch awaiting a benchmark IMO.

I considered adding a regression test for unfilled-segment compressibility,
but wasn't sure where it would most appropriately go. I assume a TAP test
would be the way to do it.

Adding a test that actually compress files seems hard to make stable, and adds
a dependency on external tools which is best to avoid when possible. I took a
stab at this and added a test that ensures the last segment in the switched-
away file is completely zeroed out, which in a sense tests compressability.

The attached patch adds the test, and a neccessary extension to check_pg_config
to allow for extracting values from pg_config.h as opposed to just returning
the number of regex matches. (needed for XLOG_BLCKSZ.)

That being said, I am not entirely convinced that the test is adding much
value. It’s however easier to reason about a written patch than an idea so I
figured I’d add it here either way.

Setting this to Ready for Committer and offering my +1 on getting this in.

cheers ./daniel

Attachments:

wal_zeroed_test.patchapplication/octet-stream; name=wal_zeroed_test.patchDownload
From a7cbcde7518e2f95673ec5ddba32b7ea6e0b84bb Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson <daniel@yesql.se>
Date: Fri, 16 Mar 2018 21:32:32 +0100
Subject: [PATCH] Add test for ensuring WAL segment is zeroed out

Switch to a new WAL file and ensure the last segment in the now
switched-away-from file is completely zeroed out.

This adds a mode to check_pg_config() where the match from the
regexp can be returned, not just the literal count of matches.
Only one match is returned (the first), but given the structure
of pg_config.h that's likely to be enough.
---
 src/test/perl/TestLib.pm             | 18 ++++++++++++++++--
 src/test/recovery/t/002_archiving.pl | 20 +++++++++++++++++++-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 7d017d49bd..8b9796c4f6 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -227,15 +227,29 @@ sub append_to_file
 # that.
 sub check_pg_config
 {
-	my ($regexp) = @_;
+	my ($regexp, $value) = @_;
 	my ($stdout, $stderr);
+	my $match;
 	my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
 	  \$stdout, '2>', \$stderr
 	  or die "could not execute pg_config";
 	chomp($stdout);
 
 	open my $pg_config_h, '<', "$stdout/pg_config.h" or die "$!";
-	my $match = (grep {/^$regexp/} <$pg_config_h>);
+	if (defined $value)
+	{
+		while (<$pg_config_h>)
+		{
+			$_ =~ m/^$regexp/;
+			next unless defined $1;
+			$match = $1;
+			last;
+		}
+	}
+	else
+	{
+		$match = (grep {/^$regexp/} <$pg_config_h>);
+	}
 	close $pg_config_h;
 	return $match;
 }
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index e1bd3c95cc..c235e98904 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 1;
+use Test::More tests => 2;
 use File::Copy;
 
 # Initialize master node, doing archives
@@ -49,3 +49,21 @@ $node_standby->poll_query_until('postgres', $caughtup_query)
 my $result =
   $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(1000), 'check content from archives');
+
+# Add some content, switch WAL file and read the last segment of the WAL
+# file which should be zeroed out
+my $current_file = $node_master->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn())");
+$node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (1)");
+$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
+
+my $xlog_blcksz = check_pg_config('#define XLOG_BLCKSZ (\d+)', 1);
+
+open my $fh, '<:raw', $node_master->data_dir . '/pg_wal/' . $current_file;
+# SEEK_END is defined as 2, but is not allowed when using strict mode
+seek $fh, ($xlog_blcksz * -1), 2;
+my $len = read($fh, my $bytes, $xlog_blcksz);
+close $fh;
+
+my ($wal_segment) = unpack 'N' . $xlog_blcksz, $bytes;
+is($wal_segment, 0, 'make sure wal segment is zeroed');
-- 
2.14.1.145.gb3622a4ee

#28Chapman Flack
chap@anastigmatix.net
In reply to: Daniel Gustafsson (#27)
2 attachment(s)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/16/18 17:14, Daniel Gustafsson wrote:

The attached patch adds the test, and a neccessary extension to check_pg_config
to allow for extracting values from pg_config.h as opposed to just returning
the number of regex matches. (needed for XLOG_BLCKSZ.)

Thanks for the review. I notice that cfbot has now flagged the patch as
failing, and when I look into it, it appears that cfbot is building with
your test patch, and without the xlog.c patch, and so the test naturally
fails. Does the cfbot require both patches to be attached to the same
email, in order to include them both? I'll try attaching both to this one,
and see what it does.

This is good confirmation that the test is effective. :)

-Chap

Attachments:

0001-Zero-headers-of-unused-pages-after-WAL-switch.patchtext/plain; charset=UTF-8; name=0001-Zero-headers-of-unused-pages-after-WAL-switch.patchDownload
From 35684bdaa1d27c3f4b7198541f3c92bb2b4cb2f4 Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Sun, 25 Feb 2018 11:44:47 -0500
Subject: [PATCH] Zero headers of unused pages after WAL switch.

When writing zeroed pages to the remainder of a WAL segment
after a WAL switch, ensure that the headers of those pages are
also zeroed, as their initialized values otherwise reduce the
compressibility of the WAL segment file by general tools.
---
 src/backend/access/transam/xlog.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d..a91ec7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1556,7 +1556,16 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		{
 			/* initialize the next page (if not initialized already) */
 			WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * Fields in the page header preinitialized by AdvanceXLInsertBuffer
+			 * to nonconstant values reduce the compressibility of WAL segments,
+			 * and aren't needed in the freespace following a switch record.
+			 * Re-zero that header area. This is not performance-critical, as
+			 * the more empty pages there are for this loop to touch, the less
+			 * busy the system is.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
 			CurrPos += XLOG_BLCKSZ;
 		}
 	}
-- 
2.7.3

wal_zeroed_test.patchtext/plain; charset=UTF-8; name=wal_zeroed_test.patchDownload
From a7cbcde7518e2f95673ec5ddba32b7ea6e0b84bb Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson <daniel@yesql.se>
Date: Fri, 16 Mar 2018 21:32:32 +0100
Subject: [PATCH] Add test for ensuring WAL segment is zeroed out

Switch to a new WAL file and ensure the last segment in the now
switched-away-from file is completely zeroed out.

This adds a mode to check_pg_config() where the match from the
regexp can be returned, not just the literal count of matches.
Only one match is returned (the first), but given the structure
of pg_config.h that's likely to be enough.
---
 src/test/perl/TestLib.pm             | 18 ++++++++++++++++--
 src/test/recovery/t/002_archiving.pl | 20 +++++++++++++++++++-
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 7d017d49bd..8b9796c4f6 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -227,15 +227,29 @@ sub append_to_file
 # that.
 sub check_pg_config
 {
-	my ($regexp) = @_;
+	my ($regexp, $value) = @_;
 	my ($stdout, $stderr);
+	my $match;
 	my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
 	  \$stdout, '2>', \$stderr
 	  or die "could not execute pg_config";
 	chomp($stdout);
 
 	open my $pg_config_h, '<', "$stdout/pg_config.h" or die "$!";
-	my $match = (grep {/^$regexp/} <$pg_config_h>);
+	if (defined $value)
+	{
+		while (<$pg_config_h>)
+		{
+			$_ =~ m/^$regexp/;
+			next unless defined $1;
+			$match = $1;
+			last;
+		}
+	}
+	else
+	{
+		$match = (grep {/^$regexp/} <$pg_config_h>);
+	}
 	close $pg_config_h;
 	return $match;
 }
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index e1bd3c95cc..c235e98904 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 1;
+use Test::More tests => 2;
 use File::Copy;
 
 # Initialize master node, doing archives
@@ -49,3 +49,21 @@ $node_standby->poll_query_until('postgres', $caughtup_query)
 my $result =
   $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(1000), 'check content from archives');
+
+# Add some content, switch WAL file and read the last segment of the WAL
+# file which should be zeroed out
+my $current_file = $node_master->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn())");
+$node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (1)");
+$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
+
+my $xlog_blcksz = check_pg_config('#define XLOG_BLCKSZ (\d+)', 1);
+
+open my $fh, '<:raw', $node_master->data_dir . '/pg_wal/' . $current_file;
+# SEEK_END is defined as 2, but is not allowed when using strict mode
+seek $fh, ($xlog_blcksz * -1), 2;
+my $len = read($fh, my $bytes, $xlog_blcksz);
+close $fh;
+
+my ($wal_segment) = unpack 'N' . $xlog_blcksz, $bytes;
+is($wal_segment, 0, 'make sure wal segment is zeroed');
-- 
2.14.1.145.gb3622a4ee

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chapman Flack (#28)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Chapman Flack <chap@anastigmatix.net> writes:

Thanks for the review. I notice that cfbot has now flagged the patch as
failing, and when I look into it, it appears that cfbot is building with
your test patch, and without the xlog.c patch, and so the test naturally
fails. Does the cfbot require both patches to be attached to the same
email, in order to include them both?

I believe so --- AFAIK it does not know anything about dependencies
between different patches, and will just try to build whatever patch(es)
appear in the latest email on a given thread. Munro might be able to
provide more detail.

regards, tom lane

#30Daniel Gustafsson
daniel@yesql.se
In reply to: Tom Lane (#29)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 18 Mar 2018, at 03:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chapman Flack <chap@anastigmatix.net> writes:

Thanks for the review. I notice that cfbot has now flagged the patch as
failing, and when I look into it, it appears that cfbot is building with
your test patch, and without the xlog.c patch, and so the test naturally
fails. Does the cfbot require both patches to be attached to the same
email, in order to include them both?

I believe so --- AFAIK it does not know anything about dependencies
between different patches, and will just try to build whatever patch(es)
appear in the latest email on a given thread. Munro might be able to
provide more detail.

Right, I should’ve realized when I didn’t include your original patch as well,
sorry about that. Now we know that the proposed test fails without the patch
applied and clears with it, that was at least an interesting side effect =)

cheers ./daniel

#31Chapman Flack
chap@anastigmatix.net
In reply to: Daniel Gustafsson (#30)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/18/18 16:56, Daniel Gustafsson wrote:

sorry about that. Now we know that the proposed test fails without the patch
applied and clears with it, that was at least an interesting side effect =)

It was, and it got me looking at the test, and even though it does detect
the difference between patch-applied and patch-not-applied, I sort of wonder
if it does what it claims to. It seems to me that unpack('N8192', ...)
would want to return 8192 32-bit ints (in array context), but really only
be able to return 2048 of them (because it's only got 8192 bytes to unpack),
and then being used in scalar context, it only returns the first one anyway,
so the test only hinges on whether the first four bytes of the block are
zero or not. Which turns out to be enough to catch a non-zeroed header. :)

What would you think about replacing the last two lines with just

ok($bytes =~ /\A\x00*+\z/, 'make sure wal segment is zeroed');

-Chap

#32Daniel Gustafsson
daniel@yesql.se
In reply to: Chapman Flack (#31)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 18 Mar 2018, at 22:54, Chapman Flack <chap@anastigmatix.net> wrote:

On 03/18/18 16:56, Daniel Gustafsson wrote:

sorry about that. Now we know that the proposed test fails without the patch
applied and clears with it, that was at least an interesting side effect =)

It was, and it got me looking at the test, and even though it does detect
the difference between patch-applied and patch-not-applied, I sort of wonder
if it does what it claims to. It seems to me that unpack('N8192', ...)
would want to return 8192 32-bit ints (in array context), but really only
be able to return 2048 of them (because it's only got 8192 bytes to unpack),
and then being used in scalar context, it only returns the first one anyway,
so the test only hinges on whether the first four bytes of the block are
zero or not. Which turns out to be enough to catch a non-zeroed header. :)

Good point, thats what I get for hacking without enough coffee.

What would you think about replacing the last two lines with just

ok($bytes =~ /\A\x00*+\z/, 'make sure wal segment is zeroed’);

It seems expensive to regex over BLCKSZ, but it’s probably the safest option
and it’s not a performance critical codepath. Feel free to whack the test
patch over the head with the above diff.

cheers ./daniel

#33Chapman Flack
chap@anastigmatix.net
In reply to: Daniel Gustafsson (#32)
2 attachment(s)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/18/18 19:28, Daniel Gustafsson wrote:

It seems expensive to regex over BLCKSZ, but it’s probably the safest option
and it’s not a performance critical codepath. Feel free to whack the test
patch over the head with the above diff.

Both patches in a single email for cfbot's enjoyment, coming right up.

-Chap

Attachments:

0002-Add-test-for-ensuring-WAL-segment-is-zeroed-out.patchtext/x-patch; name=0002-Add-test-for-ensuring-WAL-segment-is-zeroed-out.patchDownload
From b2d36e3e4f4cd003cd7564a217bb15c19e1da5e2 Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Sun, 18 Mar 2018 20:03:04 -0400
Subject: [PATCH] Add test for ensuring WAL segment is zeroed out

Switch to a new WAL file and ensure the last block in the now
switched-away-from file is completely zeroed out.

This adds a mode to check_pg_config() where the match from the
regexp can be returned, not just the literal count of matches.
Only one match is returned (the first), but given the structure
of pg_config.h that's likely to be enough.
---
 .travis.yml                          | 47 ++++++++++++++++++++++++++++++++++++
 src/test/perl/TestLib.pm             | 18 ++++++++++++--
 src/test/recovery/t/002_archiving.pl | 19 ++++++++++++++-
 3 files changed, 81 insertions(+), 3 deletions(-)
 create mode 100644 .travis.yml

diff --git a/.travis.yml b/.travis.yml
new file mode 100644
index 0000000..386bb60
--- /dev/null
+++ b/.travis.yml
@@ -0,0 +1,47 @@
+
+sudo: required
+addons:
+  apt:
+    packages:
+      - gdb
+      - lcov
+      - libipc-run-perl
+      - libperl-dev
+      - libpython-dev
+      - tcl-dev
+      - libldap2-dev
+      - libicu-dev
+      - docbook
+      - docbook-dsssl
+      - docbook-xsl
+      - libxml2-utils
+      - openjade1.3
+      - opensp
+      - xsltproc
+language: c
+cache: ccache
+before_install:
+  - echo '/tmp/%e-%s-%p.core' | sudo tee /proc/sys/kernel/core_pattern
+  - echo "deb http://archive.ubuntu.com/ubuntu xenial main" | sudo tee /etc/apt/sources.list.d/xenial.list > /dev/null
+  - |
+    sudo tee -a /etc/apt/preferences.d/trusty > /dev/null <<EOF
+    Package: *
+    Pin: release n=xenial
+    Pin-Priority: 1
+    
+    Package: make
+    Pin: release n=xenial
+    Pin-Priority: 500
+    EOF
+  - sudo apt-get update && sudo apt-get install make
+script: ./configure --enable-debug --enable-cassert --enable-coverage --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap --with-icu && make -j4 all contrib docs && make -Otarget -j3 check-world
+after_success:
+  - bash <(curl -s https://codecov.io/bash)
+after_failure:
+  - for f in ` find . -name regression.diffs ` ; do echo "========= Contents of $f" ; head -1000 $f ; done
+  - |
+    for corefile in $(find /tmp/ -name '*.core' 2>/dev/null) ; do
+      binary=$(gdb -quiet -core $corefile -batch -ex 'info auxv' | grep AT_EXECFN | perl -pe "s/^.*\"(.*)\"\$/\$1/g")
+      echo dumping $corefile for $binary
+      gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" $binary $corefile
+    done
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 7d017d4..8b9796c 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -227,15 +227,29 @@ sub append_to_file
 # that.
 sub check_pg_config
 {
-	my ($regexp) = @_;
+	my ($regexp, $value) = @_;
 	my ($stdout, $stderr);
+	my $match;
 	my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
 	  \$stdout, '2>', \$stderr
 	  or die "could not execute pg_config";
 	chomp($stdout);
 
 	open my $pg_config_h, '<', "$stdout/pg_config.h" or die "$!";
-	my $match = (grep {/^$regexp/} <$pg_config_h>);
+	if (defined $value)
+	{
+		while (<$pg_config_h>)
+		{
+			$_ =~ m/^$regexp/;
+			next unless defined $1;
+			$match = $1;
+			last;
+		}
+	}
+	else
+	{
+		$match = (grep {/^$regexp/} <$pg_config_h>);
+	}
 	close $pg_config_h;
 	return $match;
 }
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index e1bd3c9..9551db1 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 1;
+use Test::More tests => 2;
 use File::Copy;
 
 # Initialize master node, doing archives
@@ -49,3 +49,20 @@ $node_standby->poll_query_until('postgres', $caughtup_query)
 my $result =
   $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(1000), 'check content from archives');
+
+# Add some content, switch WAL file and read the last segment of the WAL
+# file which should be zeroed out
+my $current_file = $node_master->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn())");
+$node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (1)");
+$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
+
+my $xlog_blcksz = check_pg_config('#define XLOG_BLCKSZ (\d+)', 1);
+
+open my $fh, '<:raw', $node_master->data_dir . '/pg_wal/' . $current_file;
+# SEEK_END is defined as 2, but is not allowed when using strict mode
+seek $fh, ($xlog_blcksz * -1), 2;
+my $len = read($fh, my $bytes, $xlog_blcksz);
+close $fh;
+
+ok($bytes =~ /\A\x00*+\z/, 'make sure wal segment is zeroed');
-- 
2.7.3

0001-Zero-headers-of-unused-pages-after-WAL-switch.patchtext/plain; charset=UTF-8; name=0001-Zero-headers-of-unused-pages-after-WAL-switch.patchDownload
From 35684bdaa1d27c3f4b7198541f3c92bb2b4cb2f4 Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Sun, 25 Feb 2018 11:44:47 -0500
Subject: [PATCH] Zero headers of unused pages after WAL switch.

When writing zeroed pages to the remainder of a WAL segment
after a WAL switch, ensure that the headers of those pages are
also zeroed, as their initialized values otherwise reduce the
compressibility of the WAL segment file by general tools.
---
 src/backend/access/transam/xlog.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d..a91ec7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1556,7 +1556,16 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		{
 			/* initialize the next page (if not initialized already) */
 			WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * Fields in the page header preinitialized by AdvanceXLInsertBuffer
+			 * to nonconstant values reduce the compressibility of WAL segments,
+			 * and aren't needed in the freespace following a switch record.
+			 * Re-zero that header area. This is not performance-critical, as
+			 * the more empty pages there are for this loop to touch, the less
+			 * busy the system is.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
 			CurrPos += XLOG_BLCKSZ;
 		}
 	}
-- 
2.7.3

#34Stephen Frost
sfrost@snowman.net
In reply to: Chapman Flack (#33)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Greetings,

* Chapman Flack (chap@anastigmatix.net) wrote:

On 03/18/18 19:28, Daniel Gustafsson wrote:

It seems expensive to regex over BLCKSZ, but it’s probably the safest option
and it’s not a performance critical codepath. Feel free to whack the test
patch over the head with the above diff.

Both patches in a single email for cfbot's enjoyment, coming right up.

Thanks for this. I'm also of the opinion, having read through the
thread, that it was an unintentional side-effect to have the headers in
the otherwise-zero'd end of the WAL segment and that re-zero'ing the end
makes sense and while it's a bit of extra time, it's not on a
performance-critical path given this is only happening on a less-busy
system to begin with.

I'm mildly disappointed to see the gzip has only a 2x improvement with
the change, but that's certainly not this patch's fault.

Reviewing the patch itself..

.travis.yml | 47 ++++++++++++++++++++++++++++++++++++

... not something that I think we're going to add into the main tree.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 47a6c4d..a91ec7b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1556,7 +1556,16 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
{
/* initialize the next page (if not initialized already) */
WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * Fields in the page header preinitialized by AdvanceXLInsertBuffer
+			 * to nonconstant values reduce the compressibility of WAL segments,
+			 * and aren't needed in the freespace following a switch record.
+			 * Re-zero that header area. This is not performance-critical, as
+			 * the more empty pages there are for this loop to touch, the less
+			 * busy the system is.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
CurrPos += XLOG_BLCKSZ;
}
}

AdvanceXLInsertBuffer() does quite a bit, so I'm a bit surprised to see
this simply removing that call, you're confident there's nothing done
which still needs doing..?

Thanks!

Stephen

#35Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#34)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On Sun, Mar 25, 2018 at 11:27:31PM -0400, Stephen Frost wrote:

AdvanceXLInsertBuffer() does quite a bit, so I'm a bit surprised to see
this simply removing that call, you're confident there's nothing done
which still needs doing..?

Have a look at BKP_REMOVABLE then. This was moved to page headers in
2dd9322, still it seems to me that the small benefits outlined on this
thread don't justify breaking tools relying on this flag set, especially
if there is no replacement for it.

So I would vote to not commit this patch.
--
Michael

#36Chapman Flack
chap@anastigmatix.net
In reply to: Stephen Frost (#34)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/25/18 23:27, Stephen Frost wrote:

.travis.yml | 47 ++++++++++++++++++++++++++++++++++++

... not something that I think we're going to add into the main tree.

Looks like that got in by mistake, sorry.

-			AdvanceXLInsertBuffer(CurrPos, false);
...
+			currpos = GetXLogBuffer(CurrPos);

AdvanceXLInsertBuffer() does quite a bit, so I'm a bit surprised to see
this simply removing that call, you're confident there's nothing done
which still needs doing..?

My belief from looking at the code was that AdvanceXLInsertBuffer() is among
the things GetXLogBuffer() does, so calling both would result in two calls
to the former (which I don't believe would hurt, it would only
do enough work the second time to determine it had already been done).

However, it is done *conditionally* within GetXLogBuffer(), so it doesn't
hurt to have extra eyes reviewing my belief that the condition will be true
in this case (looping through tail blocks that haven't been touched yet).

-Chap

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chapman Flack (#36)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Chapman Flack <chap@anastigmatix.net> writes:

On 03/25/18 23:27, Stephen Frost wrote:

AdvanceXLInsertBuffer() does quite a bit, so I'm a bit surprised to see
this simply removing that call, you're confident there's nothing done
which still needs doing..?

My belief from looking at the code was that AdvanceXLInsertBuffer() is among
the things GetXLogBuffer() does, so calling both would result in two calls
to the former (which I don't believe would hurt, it would only
do enough work the second time to determine it had already been done).

If that's the argument, why is the WALInsertLockUpdateInsertingAt(CurrPos)
call still there? GetXLogBuffer() would do that too.

In any case, the new comment seems very poorly written, as it fails to
explain what's going on, and the reference to a function that is not
actually called from the vicinity of the comment doesn't help. I'd
suggest something like "GetXLogBuffer() will fetch and initialize the
next WAL page for us. But it initializes the page header to nonzero,
which is undesirable because X, so we then reset the header to zero".
It might also be worth explaining how you know that the new page's
header is short not long.

regards, tom lane

#38Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#37)
2 attachment(s)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/26/18 12:34, Tom Lane wrote:

If that's the argument, why is the WALInsertLockUpdateInsertingAt(CurrPos)
call still there? GetXLogBuffer() would do that too.

"Because I hadn't noticed that," he said, sheepishly.

In any case, the new comment ... fails to
explain what's going on, and the reference to a function that is not
actually called from the vicinity of the comment ...
suggest something like "GetXLogBuffer() will fetch and initialize the
next WAL page for us. ... worth explaining how you know that the new
page's header is short not long.

Here are patches responding to that (and also fixing the unintended
inclusion of .travis.yml).

What I have not done here is respond to Michael's objection, which
I haven't had time to think more about yet.

-Chap

Attachments:

0001-Zero-headers-of-unused-pages-after-WAL-switch.patchtext/x-patch; name=0001-Zero-headers-of-unused-pages-after-WAL-switch.patchDownload
From 3606cccfd584654970f56e909798d0d163fa7e96 Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Mon, 26 Mar 2018 23:08:26 -0400
Subject: [PATCH 1/2] Zero headers of unused pages after WAL switch.

When writing zeroed pages to the remainder of a WAL segment
after a WAL switch, ensure that the headers of those pages are
also zeroed, as their initialized values otherwise reduce the
compressibility of the WAL segment file by general tools.
---
 src/backend/access/transam/xlog.c | 45 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cb9c2a2..cf4eaa1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1552,11 +1552,50 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		/* Use up all the remaining space on the first page */
 		CurrPos += freespace;
 
+		/*
+		 * Cause all remaining pages in the segment to be flushed, leaving the
+		 * XLog position where it should be at the start of the next segment. In
+		 * the process, the pages will be initialized and physically written to
+		 * the file. That costs a bit of I/O (compared to simply leaving the
+		 * rest of the file unwritten, as was once done), but has an advantage
+		 * that the tail of the file will contain predictable (ideally constant)
+		 * data, so that general-purpose compression tools perform well on it.
+		 * (If left unwritten, the tail contains whatever is left over from the
+		 * file's last use as an earlier segment, and may compress poorly.) The
+		 * I/O cost is of little concern because any period when WAL segments
+		 * are being switched out (for, e.g., checkpoint timeout reasons) before
+		 * they are filled is clearly a low-workload period.
+		 */
 		while (CurrPos < EndPos)
 		{
-			/* initialize the next page (if not initialized already) */
-			WALInsertLockUpdateInsertingAt(CurrPos);
-			AdvanceXLInsertBuffer(CurrPos, false);
+			/*
+			 * This loop only touches full pages that follow the last actually-
+			 * used data in the segment. It will never touch the first page of a
+			 * segment; we would not be here to switch out a segment to which
+			 * nothing at all had been written.
+			 *
+			 * The minimal actions to flush the page would be to call
+			 * WALInsertLockUpdateInsertingAt(CurrPos) followed by
+			 * AdvanceXLInsertBuffer(...). The page would be left initialized
+			 * mostly to zeros, except for the page header (always the short
+			 * variant, as this is never a segment's first page).
+			 *
+			 * The large vistas of zeros are good for compressibility, but
+			 * the short headers interrupting them every XLOG_BLCKSZ (and,
+			 * worse, with values that differ from page to page) are not. The
+			 * effect varies with compression tool, but bzip2 (which compressed
+			 * best among several common tools on scantly-filled WAL segments)
+			 * compresses about an order of magnitude worse if those headers
+			 * are left in place.
+			 *
+			 * Rather than complicating AdvanceXLInsertBuffer itself (which is
+			 * called in heavily-loaded circumstances as well as this lightly-
+			 * loaded one) with variant behavior, we just use GetXLogBuffer
+			 * (which itself calls the two methods we need) to get the pointer
+			 * and re-zero the short header.
+			 */
+			currpos = GetXLogBuffer(CurrPos);
+			MemSet(currpos, 0, SizeOfXLogShortPHD);
 			CurrPos += XLOG_BLCKSZ;
 		}
 	}
-- 
2.7.3

0002-Add-test-for-ensuring-WAL-segment-is-zeroed-out.patchtext/x-patch; name=0002-Add-test-for-ensuring-WAL-segment-is-zeroed-out.patchDownload
From f9549ec41d84e60964887d4784abe4562b0bda0f Mon Sep 17 00:00:00 2001
From: Chapman Flack <chap@anastigmatix.net>
Date: Mon, 26 Mar 2018 23:09:04 -0400
Subject: [PATCH 2/2] Add test for ensuring WAL segment is zeroed out

Switch to a new WAL file and ensure the last block in the now
switched-away-from file is completely zeroed out.

This adds a mode to check_pg_config() where the match from the
regexp can be returned, not just the literal count of matches.
Only one match is returned (the first), but given the structure
of pg_config.h that's likely to be enough.
---
 src/test/perl/TestLib.pm             | 18 ++++++++++++++++--
 src/test/recovery/t/002_archiving.pl | 19 ++++++++++++++++++-
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index b686268..0f28bc8 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -245,15 +245,29 @@ sub append_to_file
 # that.
 sub check_pg_config
 {
-	my ($regexp) = @_;
+	my ($regexp, $value) = @_;
 	my ($stdout, $stderr);
+	my $match;
 	my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
 	  \$stdout, '2>', \$stderr
 	  or die "could not execute pg_config";
 	chomp($stdout);
 
 	open my $pg_config_h, '<', "$stdout/pg_config.h" or die "$!";
-	my $match = (grep {/^$regexp/} <$pg_config_h>);
+	if (defined $value)
+	{
+		while (<$pg_config_h>)
+		{
+			$_ =~ m/^$regexp/;
+			next unless defined $1;
+			$match = $1;
+			last;
+		}
+	}
+	else
+	{
+		$match = (grep {/^$regexp/} <$pg_config_h>);
+	}
 	close $pg_config_h;
 	return $match;
 }
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index e1bd3c9..9551db1 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 1;
+use Test::More tests => 2;
 use File::Copy;
 
 # Initialize master node, doing archives
@@ -49,3 +49,20 @@ $node_standby->poll_query_until('postgres', $caughtup_query)
 my $result =
   $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(1000), 'check content from archives');
+
+# Add some content, switch WAL file and read the last segment of the WAL
+# file which should be zeroed out
+my $current_file = $node_master->safe_psql('postgres',
+	"SELECT pg_walfile_name(pg_current_wal_lsn())");
+$node_master->safe_psql('postgres', "INSERT INTO tab_int VALUES (1)");
+$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
+
+my $xlog_blcksz = check_pg_config('#define XLOG_BLCKSZ (\d+)', 1);
+
+open my $fh, '<:raw', $node_master->data_dir . '/pg_wal/' . $current_file;
+# SEEK_END is defined as 2, but is not allowed when using strict mode
+seek $fh, ($xlog_blcksz * -1), 2;
+my $len = read($fh, my $bytes, $xlog_blcksz);
+close $fh;
+
+ok($bytes =~ /\A\x00*+\z/, 'make sure wal segment is zeroed');
-- 
2.7.3

#39Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#35)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/26/18 01:43, Michael Paquier wrote:

Have a look at BKP_REMOVABLE then. This was moved to page headers in
2dd9322, still it seems to me that the small benefits outlined on this
thread don't justify breaking tools relying on this flag set, especially
if there is no replacement for it.

Is 2dd9322 a commit? I'm having difficulty finding it:

https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=2dd9322

Am I searching wrong?

I probably won't have more time to look at this tonight, but could
I ask in advance for examples of tools that would need this bit when
encountering unwritten pages at the end of a segment?

I don't mean that as snark; it's just nonobvious to me and I might need
a little nudge where to look.

-Chap

#40Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Chapman Flack (#39)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/28/2018 12:32 AM, Chapman Flack wrote:

On 03/26/18 01:43, Michael Paquier wrote:

Have a look at BKP_REMOVABLE then. This was moved to page headers in
2dd9322, still it seems to me that the small benefits outlined on this
thread don't justify breaking tools relying on this flag set, especially
if there is no replacement for it.

Is 2dd9322 a commit? I'm having difficulty finding it:

https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=2dd9322

Am I searching wrong?

Not sure what's up with gitweb, but git finds it without any issue:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2dd9322ba6eea76800b38bfea0599fbc459458f2

I probably won't have more time to look at this tonight, but could
I ask in advance for examples of tools that would need this bit when
encountering unwritten pages at the end of a segment?

I don't mean that as snark; it's just nonobvious to me and I might need
a little nudge where to look.

-Chap

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#41Michael Paquier
michael@paquier.xyz
In reply to: Chapman Flack (#39)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On Tue, Mar 27, 2018 at 06:32:08PM -0400, Chapman Flack wrote:

Is 2dd9322 a commit? I'm having difficulty finding it:

https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=search&amp;h=HEAD&amp;st=commit&amp;s=2dd9322

Am I searching wrong?

I probably won't have more time to look at this tonight, but could
I ask in advance for examples of tools that would need this bit when
encountering unwritten pages at the end of a segment?

Here you go for one example:
https://sourceforge.net/projects/pglesslog/
At the end we may want to perhaps just drop the flag, but that's quite a
hard call as there could be people not around who use it..
--
Michael

#42Chapman Flack
chap@anastigmatix.net
In reply to: Tomas Vondra (#40)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/27/18 20:09, Tomas Vondra wrote:

Not sure what's up with gitweb, but git finds it without any issue:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2dd9322ba6eea76800b38bfea0599fbc459458f2

Thanks, that worked.

On 03/27/18 22:10, Michael Paquier wrote:

Here you go for one example:
https://sourceforge.net/projects/pglesslog/

So far, I have been able to study the commit pertaining to
XLP_BKP_REMOVABLE. For again some odd reason, I am striking out on
finding pglesslog code to study. Using the clone URL offered by sourceforge:

$ git clone https://git.code.sf.net/p/pglesslog/code pglesslog-code
Cloning into 'pglesslog-code'...
warning: You appear to have cloned an empty repository.
Checking connectivity... done.

and there's a Files tab, but it tells me This project has no files.

I can find 1.4.2 beta on pgFoundry, but that predates the BKP_REMOVABLE
commit.

In any case, from my study of the commit, it is hard for me to see an issue.
The code comment says: "mark the header to indicate that WAL records
beginning in this page have removable backup blocks."

In the only case where this patch will zero a header--in the unused space
following the switch record in a segment--there are no "WAL records
beginning in this page". There will not be another WAL record of any kind
until the next valid page (with valid xlp_magic xlp_tli xlp_pageaddr),
which will be at the start of the next segment, and that page will have
XLP_BKP_REMOVABLE if it ought to, and that will tell the reader what it
needs to know.

Am I overlooking something?

-Chap

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chapman Flack (#42)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Chapman Flack <chap@anastigmatix.net> writes:

On 03/27/18 22:10, Michael Paquier wrote:

Here you go for one example:
https://sourceforge.net/projects/pglesslog/

In any case, from my study of the commit, it is hard for me to see an issue.
The code comment says: "mark the header to indicate that WAL records
beginning in this page have removable backup blocks."

Yeah, that commit just moved a flag from individual WAL records to page
headers, arguing that it was okay to assume that the same flag value
applies to all records on a page. If there are no records in the page,
it doesn't matter what you think the flag value is.

A potentially stronger complaint is that WAL-reading tools might fail
outright on a page with an invalid header, but I'd say that's a robustness
issue that they'd need to address anyway. There's never been any
guarantee that the trailing pages of a WAL segment are valid.

regards, tom lane

#44Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#43)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Chapman Flack <chap@anastigmatix.net> writes:

On 03/27/18 22:10, Michael Paquier wrote:

Here you go for one example:
https://sourceforge.net/projects/pglesslog/

In any case, from my study of the commit, it is hard for me to see an issue.
The code comment says: "mark the header to indicate that WAL records
beginning in this page have removable backup blocks."

Yeah, that commit just moved a flag from individual WAL records to page
headers, arguing that it was okay to assume that the same flag value
applies to all records on a page. If there are no records in the page,
it doesn't matter what you think the flag value is.

A potentially stronger complaint is that WAL-reading tools might fail
outright on a page with an invalid header, but I'd say that's a robustness
issue that they'd need to address anyway. There's never been any
guarantee that the trailing pages of a WAL segment are valid.

Agreed, I don't buy off that tools which fall apart when reading a page
with an invalid header should block this from moving forward- those
tools need to be fixed to not rely on trailing/unused WAL pages to be
valid.

Thanks!

Stephen

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#44)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

A potentially stronger complaint is that WAL-reading tools might fail
outright on a page with an invalid header, but I'd say that's a robustness
issue that they'd need to address anyway. There's never been any
guarantee that the trailing pages of a WAL segment are valid.

Agreed, I don't buy off that tools which fall apart when reading a page
with an invalid header should block this from moving forward- those
tools need to be fixed to not rely on trailing/unused WAL pages to be
valid.

Yup. Pushed with some rewriting of the comments.

I did not like the proposed test case too much, particularly not its
undocumented API change for check_pg_config, so I did not push that.
We already have test coverage for pg_switch_wal() so it doesn't seem
very critical to have more.

regards, tom lane

#46Chapman Flack
chap@anastigmatix.net
In reply to: Tom Lane (#45)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/30/18 16:21, Tom Lane wrote:

Yup. Pushed with some rewriting of the comments.

Thanks!

I did not like the proposed test case too much, particularly not its
undocumented API change for check_pg_config,

Other than that API change, was there something the test case could have
done differently to make you like it more?

-Chap

#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chapman Flack (#46)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

Chapman Flack <chap@anastigmatix.net> writes:

On 03/30/18 16:21, Tom Lane wrote:

I did not like the proposed test case too much, particularly not its
undocumented API change for check_pg_config,

Other than that API change, was there something the test case could have
done differently to make you like it more?

Well, if that'd been properly documented I'd probably have pushed it
without complaint. But I did wonder whether it could've been folded
into one of the existing tests of pg_switch_wal(). This doesn't seem
like a property worth spending a lot of cycles on testing.

regards, tom lane

#48Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#47)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On Fri, Mar 30, 2018 at 07:11:02PM -0400, Tom Lane wrote:

Chapman Flack <chap@anastigmatix.net> writes:

On 03/30/18 16:21, Tom Lane wrote:

I did not like the proposed test case too much, particularly not its
undocumented API change for check_pg_config,

Other than that API change, was there something the test case could have
done differently to make you like it more?

Well, if that'd been properly documented I'd probably have pushed it
without complaint. But I did wonder whether it could've been folded
into one of the existing tests of pg_switch_wal(). This doesn't seem
like a property worth spending a lot of cycles on testing.

Sorry for coming in late. I have been busy doing some net-archeology to
look for tools using XLP_BKP_REMOVABLE. One is pglesslog that we
already know about. However I have to be honest, I have not been able
to find its source code, nor have I seen another tool making use of
XLP_BKP_REMOVABLE. Could we just remove the flag then?
--
Michael

#49Chapman Flack
chap@anastigmatix.net
In reply to: Michael Paquier (#48)
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

On 03/30/18 19:57, Michael Paquier wrote:

look for tools using XLP_BKP_REMOVABLE. One is pglesslog that we
already know about. However I have to be honest, I have not been able
to find its source code,

I'm glad it wasn't just me.

-Chap