UUID-OSSP Contrib Module Compilation Issue

Started by Bruce McAlisterabout 17 years ago65 messages
#1Bruce McAlister
bruce.mcalister@blueface.ie

Hi All,

I am trying to build the uuid-ossp contrib module for PostgreSQL 8.3.4.
I am building on Solaris x86 with Sun Studio 12.

I built the ossp-uuid version 1.6.2 libraries and installed them,
however, whenever I attempt to build the contrib module I always end up
with the following error:

----------------------
+ cd contrib
+ cd uuid-ossp
+ make all
sed 's,MODULE_PATHNAME,$libdir/uuid-ossp,g' uuid-ossp.sql.in >uuid-ossp.sql
/usr/bin/cc -Xa -I/usr/sfw/include -KPIC -I. -I../../src/include
-I/usr/sfw/include   -c -o uuid-ossp.o uuid-ossp.c
"uuid-ossp.c", line 29: #error: OSSP uuid.h not found
cc: acomp failed for uuid-ossp.c
make: *** [uuid-ossp.o] Error 2
----------------------

I have the ossp uuid libraries and headers in the standar locations
(/usr/include, /usr/lib) but the checks within the contrib module dont
appear to find the ossp uuid headers I have installed.

Am I mising something here, or could the #ifdefs have something to do
with it not picking up the newer ossp uuid defnitions?

Any suggestions would be greatly appreciated.

Thanks
Bruce

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce McAlister (#1)
Re: UUID-OSSP Contrib Module Compilation Issue

Bruce McAlister <bruce.mcalister@blueface.ie> writes:

I am trying to build the uuid-ossp contrib module for PostgreSQL 8.3.4.
I am building on Solaris x86 with Sun Studio 12.

I built the ossp-uuid version 1.6.2 libraries and installed them,
however, whenever I attempt to build the contrib module I always end up
with the following error:
"uuid-ossp.c", line 29: #error: OSSP uuid.h not found

Um ... did you run PG's configure script with --with-ossp-uuid?
It looks like either you didn't do that, or configure doesn't know
to look in the place where you put the ossp-uuid header files.

regards, tom lane

#3Bruce McAlister
bruce.mcalister@blueface.ie
In reply to: Tom Lane (#2)
Re: UUID-OSSP Contrib Module Compilation Issue

Um ... did you run PG's configure script with --with-ossp-uuid?
It looks like either you didn't do that, or configure doesn't know
to look in the place where you put the ossp-uuid header files.

Doh, I missed that, however, I have now included that option but it
still does not find the libraries that I have installed.

My configure options are:

./configure --prefix=/opt/postgresql-v8.3.4 \
--with-openssl \
--without-readline \
--with-perl \
--enable-integer-datetimes \
--enable-thread-safety \
--enable-dtrace \
--with-ossp-uuid

When I run configure with the above options, I end up with the following
configure error:

checking for uuid_export in -lossp-uuid... no
checking for uuid_export in -luuid... no
configure: error: library 'ossp-uuid' or 'uuid' is required for OSSP-UUID

The uuid library that I built was obtained from the following url as
mentioned in the documentation:

http://www.ossp.org/pkg/lib/uuid/

I've built and installed version 1.6.2 and the libraries/headers built
are installed in: /usr/lib and /usr/include, the cli tool is in /usr/bin.

ll /usr/lib/*uuid* | grep 'Oct 28'
-rw-r--r-- 1 root bin 81584 Oct 28 15:33 /usr/lib/libuuid_dce.a
-rw-r--r-- 1 root bin 947 Oct 28 15:33
/usr/lib/libuuid_dce.la
lrwxrwxrwx 1 root root 22 Oct 28 15:34
/usr/lib/libuuid_dce.so -> libuuid_dce.so.16.0.22
lrwxrwxrwx 1 root root 22 Oct 28 15:34
/usr/lib/libuuid_dce.so.16 -> libuuid_dce.so.16.0.22
-rwxr-xr-x 1 root bin 80200 Oct 28 15:33
/usr/lib/libuuid_dce.so.16.0.22
-rw-r--r-- 1 root bin 77252 Oct 28 15:33 /usr/lib/libuuid.a
-rw-r--r-- 1 root bin 919 Oct 28 15:33 /usr/lib/libuuid.la
lrwxrwxrwx 1 root root 18 Oct 28 15:34
/usr/lib/libuuid.so -> libuuid.so.16.0.22
lrwxrwxrwx 1 root root 18 Oct 28 15:34
/usr/lib/libuuid.so.16 -> libuuid.so.16.0.22
-rwxr-xr-x 1 root bin 76784 Oct 28 15:33
/usr/lib/libuuid.so.16.0.22

Do I need to use a specific version of the ossp-uuid libraries for this
module?

Thanks
Bruce

#4Hiroshi Saito
z-saito@guitar.ocn.ne.jp
In reply to: Bruce McAlister (#1)
Re: UUID-OSSP Contrib Module Compilation Issue

Hi.

Um, you are reconfigure of postgresql then. It is necessary to specify with-ossp-uuid.

Regards,
Hiroshi Saito

----- Original Message -----
From: "Bruce McAlister" <bruce.mcalister@blueface.ie>
To: "pgsql" <pgsql-general@postgresql.org>
Sent: Wednesday, October 29, 2008 8:01 AM
Subject: [GENERAL] UUID-OSSP Contrib Module Compilation Issue

Show quoted text

Hi All,

I am trying to build the uuid-ossp contrib module for PostgreSQL 8.3.4.
I am building on Solaris x86 with Sun Studio 12.

I built the ossp-uuid version 1.6.2 libraries and installed them,
however, whenever I attempt to build the contrib module I always end up
with the following error:

----------------------
+ cd contrib
+ cd uuid-ossp
+ make all
sed 's,MODULE_PATHNAME,$libdir/uuid-ossp,g' uuid-ossp.sql.in >uuid-ossp.sql
/usr/bin/cc -Xa -I/usr/sfw/include -KPIC -I. -I../../src/include
-I/usr/sfw/include   -c -o uuid-ossp.o uuid-ossp.c
"uuid-ossp.c", line 29: #error: OSSP uuid.h not found
cc: acomp failed for uuid-ossp.c
make: *** [uuid-ossp.o] Error 2
----------------------

I have the ossp uuid libraries and headers in the standar locations
(/usr/include, /usr/lib) but the checks within the contrib module dont
appear to find the ossp uuid headers I have installed.

Am I mising something here, or could the #ifdefs have something to do
with it not picking up the newer ossp uuid defnitions?

Any suggestions would be greatly appreciated.

Thanks
Bruce

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce McAlister (#3)
Re: UUID-OSSP Contrib Module Compilation Issue

Bruce McAlister <bruce.mcalister@blueface.ie> writes:

When I run configure with the above options, I end up with the following
configure error:

checking for uuid_export in -lossp-uuid... no
checking for uuid_export in -luuid... no
configure: error: library 'ossp-uuid' or 'uuid' is required for OSSP-UUID

Huh. Nothing obvious in your info about why it wouldn't work. I think
you'll need to dig through the config.log output to see why these link
tests are failing. (They'll be a few hundred lines above the end of the
log, because the last part of the log is always a dump of configure's
internal variables.)

regards, tom lane

#6Hiroshi Saito
z-saito@guitar.ocn.ne.jp
In reply to: Bruce McAlister (#1)
Re: UUID-OSSP Contrib Module Compilation Issue

Do I need to use a specific version of the ossp-uuid libraries for this
module?

The 1.6.2 stable version which you use is right.

Regards,
Hiroshi Saito

#7Bruce McAlister
bruce.mcalister@blueface.ie
In reply to: Tom Lane (#5)
Re: UUID-OSSP Contrib Module Compilation Issue

Huh. Nothing obvious in your info about why it wouldn't work. I think
you'll need to dig through the config.log output to see why these link
tests are failing. (They'll be a few hundred lines above the end of the
log, because the last part of the log is always a dump of configure's
internal variables.)

In addition to the missing configure option, it turned out to be missing
LDFLAGS parameters, I just added -L/usr/lib to LDFLAGS and it all built
successfully now.

Thanks for the pointers :)

#8Bruce McAlister
bruce.mcalister@blueface.ie
In reply to: Hiroshi Saito (#6)
Re: UUID-OSSP Contrib Module Compilation Issue

The 1.6.2 stable version which you use is right.

Thanks, we managed to get it working now. Thanks for the pointers.

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce McAlister (#7)
Re: UUID-OSSP Contrib Module Compilation Issue

Bruce McAlister <bruce.mcalister@blueface.ie> writes:

In addition to the missing configure option, it turned out to be missing
LDFLAGS parameters, I just added -L/usr/lib to LDFLAGS and it all built
successfully now.

Bizarre ... I've never heard of a Unix system that didn't consider that
a default place to look. Unless this is a 64-bit machine and uuid
should have installed itself in /usr/lib64?

regards, tom lane

#10Jason Long
mailing.list@supernovasoftware.com
In reply to: Tom Lane (#9)
Decreasing WAL size effects

I am planning on setting up PITR for my application.

It does not see much traffic and it looks like the 16 MB log files
switch about every 4 hours or so during business hours.
I am also about to roll out functionality to store documents in a bytea
column. This should make the logs roll faster.

I also have to ship them off site using a T1 so setting the time to
automatically switch files will just waste bandwidth if they are still
going to be 16 MB anyway.

*1. What is the effect of recompiling and reducing the default size of
the WAL files?
2. What is the minimum suggested size?
3. If I reduce the size how will this work if I try to save a document
that is larger than the WAL size?

Any other suggestions would be most welcome.
*

Thank you for your time,

Jason Long
CEO and Chief Software Engineer
BS Physics, MS Chemical Engineering
http://www.octgsoftware.com
HJBug Founder and President
http://www.hjbug.com

#11Joshua D. Drake
jd@commandprompt.com
In reply to: Jason Long (#10)
Re: Decreasing WAL size effects

Jason Long wrote:

I am planning on setting up PITR for my application.

I also have to ship them off site using a T1 so setting the time to
automatically switch files will just waste bandwidth if they are still
going to be 16 MB anyway.

*1. What is the effect of recompiling and reducing the default size of
the WAL files?

Increased I/O

2. What is the minimum suggested size?

16 megs, the default.

3. If I reduce the size how will this work if I try to save a document
that is larger than the WAL size?

You will create more segments.

Joshua D. Drake

#12Bruce McAlister
bruce.mcalister@blueface.ie
In reply to: Tom Lane (#9)
Re: UUID-OSSP Contrib Module Compilation Issue

Bizarre ... I've never heard of a Unix system that didn't consider that
a default place to look. Unless this is a 64-bit machine and uuid
should have installed itself in /usr/lib64?

It is a rather peculiar issue, I also assumed that it would check the
standard locations, but I thought I would try it anyway and see what
happens.

The box is indeed a 64-bit system but the packages being built are all
32-bit and therefor all libraries being built are all in the standard
locations.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce McAlister (#12)
Re: UUID-OSSP Contrib Module Compilation Issue

Bruce McAlister <bruce.mcalister@blueface.ie> writes:

Bizarre ... I've never heard of a Unix system that didn't consider that
a default place to look. Unless this is a 64-bit machine and uuid
should have installed itself in /usr/lib64?

It is a rather peculiar issue, I also assumed that it would check the
standard locations, but I thought I would try it anyway and see what
happens.

The box is indeed a 64-bit system but the packages being built are all
32-bit and therefor all libraries being built are all in the standard
locations.

Hmm ... it sounds like some part of the compile toolchain didn't get the
word about wanting to build 32-bit. Perhaps the switch you really need
is along the lines of CFLAGS=-m32.

regards, tom lane

#14Greg Smith
gsmith@gregsmith.com
In reply to: Jason Long (#10)
Re: Decreasing WAL size effects

On Tue, 28 Oct 2008, Jason Long wrote:

I also have to ship them off site using a T1 so setting the time to
automatically switch files will just waste bandwidth if they are still going
to be 16 MB anyway.

The best way to handle this is to clear the unused portion of the WAL file
and then compress it before sending over the link. There is a utility
named pg_clearxlogtail available at
http://www.2ndquadrant.com/replication.htm that handles the first part of
that you may find useful here.

This reminds me yet again that pg_clearxlogtail should probably get added
to the next commitfest for inclusion into 8.4; it's really essential for a
WAN-based PITR setup and it would be nice to include it with the
distribution.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#15Joshua D. Drake
jd@commandprompt.com
In reply to: Greg Smith (#14)
Re: Decreasing WAL size effects

On Wed, 2008-10-29 at 09:05 -0400, Greg Smith wrote:

On Tue, 28 Oct 2008, Jason Long wrote:

I also have to ship them off site using a T1 so setting the time to
automatically switch files will just waste bandwidth if they are still going
to be 16 MB anyway.

The best way to handle this is to clear the unused portion of the WAL file
and then compress it before sending over the link. There is a utility
named pg_clearxlogtail available at
http://www.2ndquadrant.com/replication.htm that handles the first part of
that you may find useful here.

This reminds me yet again that pg_clearxlogtail should probably get added
to the next commitfest for inclusion into 8.4; it's really essential for a
WAN-based PITR setup and it would be nice to include it with the
distribution.

What is to be gained over just using rsync with -z?

Joshua D. Drake

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

#16Greg Smith
gsmith@gregsmith.com
In reply to: Joshua D. Drake (#15)
Re: Decreasing WAL size effects

On Thu, 30 Oct 2008, Joshua D. Drake wrote:

This reminds me yet again that pg_clearxlogtail should probably get added
to the next commitfest for inclusion into 8.4; it's really essential for a
WAN-based PITR setup and it would be nice to include it with the
distribution.

What is to be gained over just using rsync with -z?

When a new XLOG segment is created, it gets zeroed out first, so that
there's no chance it can accidentally look like a valid segment. But when
an existing segment is recycled, it gets a new header and that's it--the
rest of the 16MB is still left behind from whatever was in that segment
before. That means that even if you only write, say, 1MB of new data to a
recycled segment before a timeout that causes you to ship it somewhere
else, there will still be a full 15MB worth of junk from its previous life
which may or may not be easy to compress.

I just noticed that recently this project has been pushed into pgfoundry,
it's at
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/clearxlogtail/clearxlogtail/

What clearxlogtail does is look inside the WAL segment, and it clears the
"tail" behind the portion of that is really used. So our example file
would end up with just the 1MB of useful data, followed by 15MB of zeros
that will compress massively. Since it needs to know how XLogPageHeader
is formatted and if it makes a mistake your archive history will be
silently corrupted, it's kind of a scary utility to just download and use.
That's why I'd like to see it turn into a more official contrib module, so
that it will never lose sync with the page header format and be available
to anyone using PITR.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#17Jason Long
mailing.list@supernovasoftware.com
In reply to: Greg Smith (#16)
Re: Decreasing WAL size effects

Greg Smith wrote:

On Thu, 30 Oct 2008, Joshua D. Drake wrote:

This reminds me yet again that pg_clearxlogtail should probably get
added
to the next commitfest for inclusion into 8.4; it's really essential
for a
WAN-based PITR setup and it would be nice to include it with the
distribution.

What is to be gained over just using rsync with -z?

When a new XLOG segment is created, it gets zeroed out first, so that
there's no chance it can accidentally look like a valid segment. But
when an existing segment is recycled, it gets a new header and that's
it--the rest of the 16MB is still left behind from whatever was in
that segment before. That means that even if you only write, say, 1MB
of new data to a recycled segment before a timeout that causes you to
ship it somewhere else, there will still be a full 15MB worth of junk
from its previous life which may or may not be easy to compress.

I just noticed that recently this project has been pushed into
pgfoundry, it's at
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/clearxlogtail/clearxlogtail/

What clearxlogtail does is look inside the WAL segment, and it clears
the "tail" behind the portion of that is really used. So our example
file would end up with just the 1MB of useful data, followed by 15MB
of zeros that will compress massively. Since it needs to know how
XLogPageHeader is formatted and if it makes a mistake your archive
history will be silently corrupted, it's kind of a scary utility to
just download and use.

I would really like to add something like this to my application.
1. Should I be scared or is it just scary in general?
2. Is this safe to use with 8.3.4?
3. Any pointers on how to install and configure this?

Show quoted text

That's why I'd like to see it turn into a more official contrib
module, so that it will never lose sync with the page header format
and be available to anyone using PITR.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#18Kyle Cordes
kyle@kylecordes.com
In reply to: Greg Smith (#16)
Re: Decreasing WAL size effects

Greg Smith wrote:

there's no chance it can accidentally look like a valid segment. But
when an existing segment is recycled, it gets a new header and that's
it--the rest of the 16MB is still left behind from whatever was in that
segment before. That means that even if you only write, say, 1MB of new

[...]

What clearxlogtail does is look inside the WAL segment, and it clears
the "tail" behind the portion of that is really used. So our example
file would end up with just the 1MB of useful data, followed by 15MB of

It sure would be nice if there was a way for PG itself to zero the
unused portion of logs as they are completed, perhaps this will make it
in as part of the ideas discussed on this list a while back to make a
more "out of the box" log-ship mechanism?

--
Kyle Cordes
http://kylecordes.com

#19Jason Long
mailing.list@supernovasoftware.com
In reply to: Kyle Cordes (#18)
Re: Decreasing WAL size effects

Kyle Cordes wrote:

Greg Smith wrote:

there's no chance it can accidentally look like a valid segment. But
when an existing segment is recycled, it gets a new header and that's
it--the rest of the 16MB is still left behind from whatever was in
that segment before. That means that even if you only write, say,
1MB of new

[...]

What clearxlogtail does is look inside the WAL segment, and it clears
the "tail" behind the portion of that is really used. So our example
file would end up with just the 1MB of useful data, followed by 15MB of

It sure would be nice if there was a way for PG itself to zero the
unused portion of logs as they are completed, perhaps this will make
it in as part of the ideas discussed on this list a while back to make
a more "out of the box" log-ship mechanism?

*I agree totally. I looked at the code for clearxlogtail and it seems
short and not very complex. Hopefully something like this will at least
be a trivial to set up option in 8.4.**
*

Show quoted text
#20Greg Smith
gsmith@gregsmith.com
In reply to: Kyle Cordes (#18)
Re: Decreasing WAL size effects

On Thu, 30 Oct 2008, Kyle Cordes wrote:

It sure would be nice if there was a way for PG itself to zero the unused
portion of logs as they are completed, perhaps this will make it in as part
of the ideas discussed on this list a while back to make a more "out of the
box" log-ship mechanism?

The overhead of clearing out the whole thing is just large enough that it
can be disruptive on systems generating lots of WAL traffic, so you don't
want the main database processes bothering with that. A related fact is
that there is a noticable slowdown to clients that need a segment switch
on a newly initialized and fast system that has to create all its WAL
segments, compared to one that has been active long enough to only be
recycling them. That's why this sort of thing has been getting pushed
into the archive_command path; nothing performance-sensitive that can slow
down clients is happening there, so long as your server is powerful enough
to handle that in parallel with everything else going on.

Now, it would be possible to have that less sensitive archive code path
zero things out, but you'd need to introduce a way to note when it's been
done (so you don't do it for a segment twice) and a way to turn it off so
everybody doesn't go through that overhead (which probably means another
GUC). That's a bit much trouble to go through just for a feature with a
fairly limited use-case that can easily live outside of the engine
altogether.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#21Kyle Cordes
kyle@kylecordes.com
In reply to: Greg Smith (#20)
Re: Decreasing WAL size effects

Greg Smith wrote:

On Thu, 30 Oct 2008, Kyle Cordes wrote:

It sure would be nice if there was a way for PG itself to zero the
unused portion of logs as they are completed, perhaps this will make

The overhead of clearing out the whole thing is just large enough that
it can be disruptive on systems generating lots of WAL traffic, so you

Hmm. My understanding is that it wouldn't need to clear out the whole
thing, just the unused portion at the end. This wouldn't add any
initialize effort at startup / segment creation at all, right? The
unused portions at the end only happen when a WAL segment needs to be
finished "early" for some reason. I'd expect in a heavily loaded
system, that PG would be filling each segment, not ending them early.

However, there could easily be some reason that I am not familiar with,
that would cause a busy PG to nonetheless end a lot of segments early.

--
Kyle Cordes
http://kylecordes.com

#22Gregory Stark
stark@enterprisedb.com
In reply to: Greg Smith (#20)
Re: Decreasing WAL size effects

Greg Smith <gsmith@gregsmith.com> writes:

Now, it would be possible to have that less sensitive archive code path zero
things out, but you'd need to introduce a way to note when it's been done (so
you don't do it for a segment twice) and a way to turn it off so everybody
doesn't go through that overhead (which probably means another GUC). That's a
bit much trouble to go through just for a feature with a fairly limited
use-case that can easily live outside of the engine altogether.

Wouldn't it be just as good to indicate to the archive command the amount of
real data in the wal file and have it only bother copying up to that point?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#23Christophe
xof@thebuild.com
In reply to: Gregory Stark (#22)
Re: Decreasing WAL size effects

On Oct 30, 2008, at 2:54 PM, Gregory Stark wrote:

Wouldn't it be just as good to indicate to the archive command the
amount of
real data in the wal file and have it only bother copying up to
that point?

Hm! Interesting question: Can the WAL files be truncated, rather
than zeroed, safely?

#24Kyle Cordes
kyle@kylecordes.com
In reply to: Gregory Stark (#22)
Re: Decreasing WAL size effects

Gregory Stark wrote:

Greg Smith <gsmith@gregsmith.com> writes:

Wouldn't it be just as good to indicate to the archive command the amount of
real data in the wal file and have it only bother copying up to that point?

That sounds like a great solution to me; ideally it would be done in a
way that is always on (i.e. no setting, etc.).

On the log-recovery side, PG would need to be willing to accept
shorter-than-usual segments, if it's not already willing.

--
Kyle Cordes
http://kylecordes.com

#25Greg Smith
gsmith@gregsmith.com
In reply to: Gregory Stark (#22)
Re: Decreasing WAL size effects

On Thu, 30 Oct 2008, Gregory Stark wrote:

Wouldn't it be just as good to indicate to the archive command the amount of
real data in the wal file and have it only bother copying up to that point?

That pushes the problem of writing a little chunk of code that reads only
the right amount of data and doesn't bother compressing the rest onto the
person writing the archive command. Seems to me that leads back towards
wanting to bundle a contrib module with a good implementation of that with
the software. The whole tail clearing bit is in the same situation
pg_standby was circa 8.2: the software is available, and it works, but it
seems kind of sketchy to those not familiar with the source of the code.
Bundling it into the software as a contrib module just makes that problem
go away for end-users.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#25)
Re: Decreasing WAL size effects

Greg Smith <gsmith@gregsmith.com> writes:

That pushes the problem of writing a little chunk of code that reads only
the right amount of data and doesn't bother compressing the rest onto the
person writing the archive command. Seems to me that leads back towards
wanting to bundle a contrib module with a good implementation of that with
the software. The whole tail clearing bit is in the same situation
pg_standby was circa 8.2: the software is available, and it works, but it
seems kind of sketchy to those not familiar with the source of the code.
Bundling it into the software as a contrib module just makes that problem
go away for end-users.

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping. Putting functionality like that into
core is infinitely more interesting than putting band-aids on a
segmented approach.

regards, tom lane

#27Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#26)
Re: Decreasing WAL size effects

On Thu, 30 Oct 2008, Tom Lane wrote:

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping.

Sure, and that's why I didn't care when this got kicked out of the March
CommitFest; was hoping a better one would show up. But if 8.4 isn't going
out the door with the feature people really want, it would be nice to at
least make the stopgap kludge more easily available.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#28Jason Long
mailing.list@supernovasoftware.com
In reply to: Greg Smith (#27)
Re: Decreasing WAL size effects

Greg Smith wrote:

On Thu, 30 Oct 2008, Tom Lane wrote:

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping.

Sure, and that's why I didn't care when this got kicked out of the
March CommitFest; was hoping a better one would show up. But if 8.4
isn't going out the door with the feature people really want, it would
be nice to at least make the stopgap kludge more easily available.

+1
Sure I would rather have synchronous WAL shipping, but if that is going
to be a while or synchronous would slow down my applicaton I can get
comfortably close enough for my purposes with some highly compressible WALs.

Show quoted text

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#29Kyle Cordes
kyle@kylecordes.com
In reply to: Jason Long (#28)
Re: Decreasing WAL size effects

Jason Long wrote:

Sure I would rather have synchronous WAL shipping, but if that is going
to be a while or synchronous would slow down my applicaton I can get
comfortably close enough for my purposes with some highly compressible
WALs.

I'm way out here on the outskirts (just a user with a small pile of
servers running PG)... I would also find any improvements in WAL
shipping helpful, between now and when continuous streaming is ready.

--
Kyle Cordes
http://kylecordes.com

#30Craig Ringer
craig@postnewspapers.com.au
In reply to: Jason Long (#28)
Re: Decreasing WAL size effects

Jason Long wrote:

Greg Smith wrote:

On Thu, 30 Oct 2008, Tom Lane wrote:

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping.

Sure, and that's why I didn't care when this got kicked out of the
March CommitFest; was hoping a better one would show up. But if 8.4
isn't going out the door with the feature people really want, it would
be nice to at least make the stopgap kludge more easily available.

+1
Sure I would rather have synchronous WAL shipping, but if that is going
to be a while or synchronous would slow down my applicaton I can get
comfortably close enough for my purposes with some highly compressible
WALs.

I also tend to agree; it'd be really handy. pg_clearxlogtail (which I
use) makes me nervous despite the restore tests I've done.

If Pg truncated the WAL files before calling archive_command, and would
accept truncated WAL files on restore, that'd be really useful. Failing
that, packaging pg_clearxlogtail so it was kept in sync with the main Pg
code would be a big step.

--
Craig Ringer

#31Craig Ringer
craig@postnewspapers.com.au
In reply to: Craig Ringer (#30)
Re: Decreasing WAL size effects

If Pg truncated the WAL files before calling archive_command, and would
accept truncated WAL files on restore, that'd be really useful.

On second thought - that'd prevent reuse of WAL files, or at least force
the filesystem to potentially allocate new storage for the part that was
truncated.

Is it practical or sane to pass another argument to the archive_command:
a byte offset within the WAL file that is the last byte that must be
copied? That way, the archive_command could just avoid reading any
garbage in the first place, and write a truncated WAL file to the
archive, but Pg wouldn't have to do anything to the original files.
There'd be no need for a tool like pg_clearxlogtail, as the core server
would just report what it already knows about the WAL file.

Sound practical / sane?

--
Craig Ringer

#32Magnus Hagander
magnus@hagander.net
In reply to: Greg Smith (#27)
Re: Decreasing WAL size effects

On 31 okt 2008, at 02.18, Greg Smith <gsmith@gregsmith.com> wrote:

On Thu, 30 Oct 2008, Tom Lane wrote:

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping,
not
WAL-segment-at-a-time shipping.

Sure, and that's why I didn't care when this got kicked out of the
March CommitFest; was hoping a better one would show up. But if 8.4
isn't going out the door with the feature people really want, it
would be nice to at least make the stopgap kludge more easily
available.

+1.

It's not like we haven't had kludges in contrib before. We just need
to be careful to label it as temporary and say it will go away. As
long as it can be safe, that is. To me it sounds like passing the size
as a param and ship a tool in contrib that makes use of it would be a
reasonable compromise, but I'm not deeply familiar with the code so I
could be wrong.

/Magnus

#33Aidan Van Dyk
aidan@highrise.ca
In reply to: Greg Smith (#20)
1 attachment(s)
Re: Decreasing WAL size effects

* Greg Smith <gsmith@gregsmith.com> [081001 00:00]:

The overhead of clearing out the whole thing is just large enough that it
can be disruptive on systems generating lots of WAL traffic, so you don't
want the main database processes bothering with that. A related fact is
that there is a noticable slowdown to clients that need a segment switch
on a newly initialized and fast system that has to create all its WAL
segments, compared to one that has been active long enough to only be
recycling them. That's why this sort of thing has been getting pushed
into the archive_command path; nothing performance-sensitive that can
slow down clients is happening there, so long as your server is powerful
enough to handle that in parallel with everything else going on.

Now, it would be possible to have that less sensitive archive code path
zero things out, but you'd need to introduce a way to note when it's been
done (so you don't do it for a segment twice) and a way to turn it off so
everybody doesn't go through that overhead (which probably means another
GUC). That's a bit much trouble to go through just for a feature with a
fairly limited use-case that can easily live outside of the engine
altogether.

Remember that the place where this benifit is big is on a generally idle
server. Is it possible to make the "time based WAL switch" zero the tail? You
don't even need to fsync it for durability (although you may want to hopefully
preventing a larger fsync delay on the next commit).

<timid experince=none>
How about something like the attached. It's been spun quickly, passed
regression tests, and some simple hand tests on REL8_3_STABLE. It seem slike
HEAD can't initdb on my machine (quad opteron with SW raid1), I tried a few
revision in the last few days, and initdb dies on them all...

I'm not expert in the PG code, I just greped around what looked like reasonable
functions in xlog.c until I (hopefully) figured out the basic flow of switching
to new xlog segments. I *think* I'm using openLogFile and openLogOff
correctly.
</timid>

Setting archiving, with archive_timeout of 30s, and a few hand
pg_start_backup/pg_stop_backup you can see it *really* does make things
really compressable...

It's output is like:
Archiving 000000010000000000000002
Archiving 000000010000000000000003
Archiving 000000010000000000000004
Archiving 000000010000000000000005
Archiving 000000010000000000000006
LOG: checkpoints are occurring too frequently (10 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
Archiving 000000010000000000000007
Archiving 000000010000000000000008
Archiving 000000010000000000000009
LOG: checkpoints are occurring too frequently (7 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
Archiving 00000001000000000000000A
Archiving 00000001000000000000000B
Archiving 00000001000000000000000C
LOG: checkpoints are occurring too frequently (6 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
Archiving 00000001000000000000000D
LOG: ZEROING xlog file 0 segment 14 from 12615680 - 16777216 [4161536 bytes]
STATEMENT: SELECT pg_stop_backup();
Archiving 00000001000000000000000E
Archiving 00000001000000000000000E.00C07098.backup
LOG: ZEROING xlog file 0 segment 15 from 8192 - 16777216 [16769024 bytes]
STATEMENT: SELECT pg_stop_backup();
Archiving 00000001000000000000000F
Archiving 00000001000000000000000F.00000C60.backup
LOG: ZEROING xlog file 0 segment 16 from 8192 - 16777216 [16769024 bytes]
STATEMENT: SELECT pg_stop_backup();
Archiving 000000010000000000000010.00000F58.backup
Archiving 000000010000000000000010
LOG: ZEROING xlog file 0 segment 17 from 8192 - 16777216 [16769024 bytes]
STATEMENT: SELECT pg_stop_backup();
Archiving 000000010000000000000011
Archiving 000000010000000000000011.00000020.backup
LOG: ZEROING xlog file 0 segment 18 from 6815744 - 16777216 [9961472 bytes]
Archiving 000000010000000000000012
LOG: ZEROING xlog file 0 segment 19 from 8192 - 16777216 [16769024 bytes]
Archiving 000000010000000000000013
LOG: ZEROING xlog file 0 segment 20 from 16384 - 16777216 [16760832 bytes]
Archiving 000000010000000000000014
LOG: ZEROING xlog file 0 segment 23 from 8192 - 16777216 [16769024 bytes]
STATEMENT: SELECT pg_switch_xlog();
Archiving 000000010000000000000017
LOG: ZEROING xlog file 0 segment 24 from 8192 - 16777216 [16769024 bytes]
Archiving 000000010000000000000018
LOG: ZEROING xlog file 0 segment 25 from 8192 - 16777216 [16769024 bytes]
Archiving 000000010000000000000019

You can see that when DB activity was heavy enough to fill an xlog segment
before the timout (or interative forced switch), it didn't zero anything. It
only zeroed on a timeout switch, or a forced switch (pg_switch_xlog/pg_stop_backup).

And compressed xlog segments:
-rw-r--r-- 1 mountie mountie 18477 2008-10-31 14:44 000000010000000000000010.gz
-rw-r--r-- 1 mountie mountie 16394 2008-10-31 14:44 000000010000000000000011.gz
-rw-r--r-- 1 mountie mountie 2721615 2008-10-31 14:52 000000010000000000000012.gz
-rw-r--r-- 1 mountie mountie 16588 2008-10-31 14:52 000000010000000000000013.gz
-rw-r--r-- 1 mountie mountie 19230 2008-10-31 14:52 000000010000000000000014.gz
-rw-r--r-- 1 mountie mountie 4920063 2008-10-31 14:52 000000010000000000000015.gz
-rw-r--r-- 1 mountie mountie 5024705 2008-10-31 14:52 000000010000000000000016.gz
-rw-r--r-- 1 mountie mountie 18082 2008-10-31 14:52 000000010000000000000017.gz
-rw-r--r-- 1 mountie mountie 18477 2008-10-31 14:52 000000010000000000000018.gz
-rw-r--r-- 1 mountie mountie 16394 2008-10-31 14:52 000000010000000000000019.gz
-rw-r--r-- 1 mountie mountie 2721615 2008-10-31 15:02 00000001000000000000001A.gz
-rw-r--r-- 1 mountie mountie 16588 2008-10-31 15:02 00000001000000000000001B.gz
-rw-r--r-- 1 mountie mountie 19230 2008-10-31 15:02 00000001000000000000001C.gz

And yes, even the non-zeroed segments compress well here, because
my test load is pretty simple:
CREATE TABLE TEST
(
a numeric,
b numeric,
c numeric,
i bigint not null
);

INSERT INTO test (a,b,c,i)
SELECT random(),random(),random(),s FROM generate_series(1,1000000) s;

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

Attachments:

wip-xlog-switch-zero.patchtext/x-diff; charset=us-asciiDownload
commit 3916c54126ffade0baad4609467393d9a1b53e37
Author: Aidan Van Dyk <aidan@highrise.ca>
Date:   Fri Oct 31 12:35:24 2008 -0400

    WIP: Zero xlog tal on a forced switch
    
    If XLogWrite is called with xlog_switch, an XLog swithc has been force, either
    by a timeout based switch (archive_timeout), or an interactive force xlog
    switch (pg_switch_xlog/pg_stop_backup).  In those cases, we assume we can
    afford a little extra IO bandwidth to make xlogs so much more compressable

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8bc46da..a8d945d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1548,6 +1548,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 */
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
+				/*
+				 * If we've had an xlog switch forced, then we want to zero
+				 * out the rest of the segment.  We zero it out here because at the
+				 * force switch time, IO bandwidth isn't a problem.
+				 *   -- AIDAN
+				 */
+				if (xlog_switch)
+				{
+					char buf[1024];
+					uint32 left = (XLogSegSize - openLogOff);
+					ereport(LOG,
+						(errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]",
+								openLogId, openLogSeg,
+								openLogOff, XLogSegSize, left)
+						 ));
+					memset(buf, 0, sizeof(buf));
+					while (left > 0)
+					{
+						size_t len = (left > sizeof(buf)) ? sizeof(buf) : left;
+						write(openLogFile, buf, len);
+						left -= len;
+					}
+				}
+
 				issue_xlog_fsync();
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
#34Aidan Van Dyk
aidan@highrise.ca
In reply to: Aidan Van Dyk (#33)
Re: Decreasing WAL size effects

* Aidan Van Dyk <aidan@highrise.ca> [081031 15:11]:

Archiving 000000010000000000000012
Archiving 000000010000000000000013
Archiving 000000010000000000000014

Archiving 000000010000000000000017
Archiving 000000010000000000000018
Archiving 000000010000000000000019

Just incase anybody noticed the skip in the above sequence, the missing few
caught cauht up in me acutally using the terminal there, and made cop-pasting a
mess... I just didn't try to copy/paste them...

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#35Aidan Van Dyk
aidan@highrise.ca
In reply to: Aidan Van Dyk (#33)
1 attachment(s)
Re: Decreasing WAL size effects

* Aidan Van Dyk <aidan@highrise.ca> [081031 15:11]:

How about something like the attached. It's been spun quickly, passed
regression tests, and some simple hand tests on REL8_3_STABLE. It seem slike
HEAD can't initdb on my machine (quad opteron with SW raid1), I tried a few
revision in the last few days, and initdb dies on them all...

OK, HEAD does work, I don't know what was going on previosly... Attached is my
patch against head.

I'll try and pull out some machines on Monday to really thrash/crash this but
I'm running out of time today to set that up.

But in running head, I've come accross this:
regression=# SELECT pg_stop_backup();
WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed)

My archive script is *not* running, it ran and exited:
mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ ps -ewf | grep post
mountie 2904 1 0 16:31 pts/14 00:00:00 /home/mountie/projects/postgresql/PostgreSQL/src/test/regress/tmp_check/install/usr/local/pgsql
mountie 2906 2904 0 16:31 ? 00:00:01 postgres: writer process
mountie 2907 2904 0 16:31 ? 00:00:00 postgres: wal writer process
mountie 2908 2904 0 16:31 ? 00:00:00 postgres: archiver process last was 00000001000000000000001F
mountie 2909 2904 0 16:31 ? 00:00:01 postgres: stats collector process
mountie 2921 2904 1 16:31 ? 00:00:18 postgres: mountie regression 127.0.0.1(56455) idle

Those all match up:
mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ pstree -acp 2904
postgres,2904 -D/home/mountie/projects/postgres
├─postgres,2906
├─postgres,2907
├─postgres,2908
├─postgres,2909
└─postgres,2921

strace on the "archiver process" postgres:
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904

It *does* finally finish, postmaster log looks like ("Archving ..." is what my
archive script prints, bytes is the gzip'ed size):
Archiving 000000010000000000000016 [16397 bytes]
Archiving 000000010000000000000017 [4405457 bytes]
Archiving 000000010000000000000018 [3349243 bytes]
Archiving 000000010000000000000019 [3349505 bytes]
LOG: ZEROING xlog file 0 segment 27 from 7954432 - 16777216 [8822784 bytes]
Archiving 00000001000000000000001A [3349590 bytes]
Archiving 00000001000000000000001B [1596676 bytes]
LOG: ZEROING xlog file 0 segment 28 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001C [16398 bytes]
LOG: ZEROING xlog file 0 segment 29 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001D [16397 bytes]
LOG: ZEROING xlog file 0 segment 30 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001E [16393 bytes]
Archiving 00000001000000000000001E.00000020.backup [146 bytes]
WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed)
LOG: ZEROING xlog file 0 segment 31 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001F [16395 bytes]

So what's this "pg_stop_backup still waiting for archive to complete" for 5
minutes state? I've not seen that before (runing 8.2 and 8.3).

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

Attachments:

wip-xlog-switch-zero-HEAD.patchtext/x-diff; charset=us-asciiDownload
commit fba38257e52564276bb106d55aef14d0de481169
Author: Aidan Van Dyk <aidan@highrise.ca>
Date:   Fri Oct 31 12:35:24 2008 -0400

    WIP: Zero xlog tal on a forced switch
    
    If XLogWrite is called with xlog_switch, an XLog swithc has been force, either
    by a timeout based switch (archive_timeout), or an interactive force xlog
    switch (pg_switch_xlog/pg_stop_backup).  In those cases, we assume we can
    afford a little extra IO bandwidth to make xlogs so much more compressable

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 003098f..c6f9c79 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1600,6 +1600,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 */
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
+				/*
+				 * If we've had an xlog switch forced, then we want to zero
+				 * out the rest of the segment.  We zero it out here because at the
+				 * force switch time, IO bandwidth isn't a problem.
+				 *   -- AIDAN
+				 */
+				if (xlog_switch)
+				{
+					char buf[1024];
+					uint32 left = (XLogSegSize - openLogOff);
+					ereport(LOG,
+						(errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]",
+								openLogId, openLogSeg,
+								openLogOff, XLogSegSize, left)
+						 ));
+					memset(buf, 0, sizeof(buf));
+					while (left > 0)
+					{
+						size_t len = (left > sizeof(buf)) ? sizeof(buf) : left;
+						write(openLogFile, buf, len);
+						left -= len;
+					}
+				}
+
 				issue_xlog_fsync();
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
#36Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#26)
Re: Decreasing WAL size effects

Tom Lane wrote:

Greg Smith <gsmith@gregsmith.com> writes:

That pushes the problem of writing a little chunk of code that reads only
the right amount of data and doesn't bother compressing the rest onto the
person writing the archive command. Seems to me that leads back towards
wanting to bundle a contrib module with a good implementation of that with
the software. The whole tail clearing bit is in the same situation
pg_standby was circa 8.2: the software is available, and it works, but it
seems kind of sketchy to those not familiar with the source of the code.
Bundling it into the software as a contrib module just makes that problem
go away for end-users.

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping. Putting functionality like that into
core is infinitely more interesting than putting band-aids on a
segmented approach.

Well, I realize we want streaming archive logs, but there are still
going to be people who are archiving for point-in-time recovery, and I
assume a good number of them are going to compress their WAL files to
save space, because they have to store a lot of them. Wouldn't zeroing
out the trailing byte of WAL still help those people?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#37Bruce Momjian
bruce@momjian.us
In reply to: Aidan Van Dyk (#35)
1 attachment(s)
Improving compressibility of WAL files

The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Would someone please eyeball it?; it is useful for compressing PITR
logs even if we find a better solution for replication streaming?

As for why this patch is useful:

The real reason not to put that functionality into core (or even
contrib) is that it's a stopgap kluge. What the people who want this
functionality *really* want is continuous (streaming) log-shipping, not
WAL-segment-at-a-time shipping. Putting functionality like that into
core is infinitely more interesting than putting band-aids on a
segmented approach.

Well, I realize we want streaming archive logs, but there are still
going to be people who are archiving for point-in-time recovery, and I
assume a good number of them are going to compress their WAL files to
save space, because they have to store a lot of them. Wouldn't zeroing
out the trailing byte of WAL still help those people?

---------------------------------------------------------------------------

Aidan Van Dyk wrote:
-- Start of PGP signed section.

* Aidan Van Dyk <aidan@highrise.ca> [081031 15:11]:

How about something like the attached. It's been spun quickly, passed
regression tests, and some simple hand tests on REL8_3_STABLE. It seem slike
HEAD can't initdb on my machine (quad opteron with SW raid1), I tried a few
revision in the last few days, and initdb dies on them all...

OK, HEAD does work, I don't know what was going on previosly... Attached is my
patch against head.

I'll try and pull out some machines on Monday to really thrash/crash this but
I'm running out of time today to set that up.

But in running head, I've come accross this:
regression=# SELECT pg_stop_backup();
WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed)

My archive script is *not* running, it ran and exited:
mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ ps -ewf | grep post
mountie 2904 1 0 16:31 pts/14 00:00:00 /home/mountie/projects/postgresql/PostgreSQL/src/test/regress/tmp_check/install/usr/local/pgsql
mountie 2906 2904 0 16:31 ? 00:00:01 postgres: writer process
mountie 2907 2904 0 16:31 ? 00:00:00 postgres: wal writer process
mountie 2908 2904 0 16:31 ? 00:00:00 postgres: archiver process last was 00000001000000000000001F
mountie 2909 2904 0 16:31 ? 00:00:01 postgres: stats collector process
mountie 2921 2904 1 16:31 ? 00:00:18 postgres: mountie regression 127.0.0.1(56455) idle

Those all match up:
mountie@pumpkin:~/projects/postgresql/PostgreSQL/src/test/regress$ pstree -acp 2904
postgres,2904 -D/home/mountie/projects/postgres
??postgres,2906
??postgres,2907
??postgres,2908
??postgres,2909
??postgres,2921

strace on the "archiver process" postgres:
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
getppid() = 2904

It *does* finally finish, postmaster log looks like ("Archving ..." is what my
archive script prints, bytes is the gzip'ed size):
Archiving 000000010000000000000016 [16397 bytes]
Archiving 000000010000000000000017 [4405457 bytes]
Archiving 000000010000000000000018 [3349243 bytes]
Archiving 000000010000000000000019 [3349505 bytes]
LOG: ZEROING xlog file 0 segment 27 from 7954432 - 16777216 [8822784 bytes]
Archiving 00000001000000000000001A [3349590 bytes]
Archiving 00000001000000000000001B [1596676 bytes]
LOG: ZEROING xlog file 0 segment 28 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001C [16398 bytes]
LOG: ZEROING xlog file 0 segment 29 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001D [16397 bytes]
LOG: ZEROING xlog file 0 segment 30 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001E [16393 bytes]
Archiving 00000001000000000000001E.00000020.backup [146 bytes]
WARNING: pg_stop_backup still waiting for archive to complete (60 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (120 seconds elapsed)
WARNING: pg_stop_backup still waiting for archive to complete (240 seconds elapsed)
LOG: ZEROING xlog file 0 segment 31 from 8192 - 16777216 [16769024 bytes]
Archiving 00000001000000000000001F [16395 bytes]

So what's this "pg_stop_backup still waiting for archive to complete" for 5
minutes state? I've not seen that before (runing 8.2 and 8.3).

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

[ Attachment, skipping... ]
-- End of PGP section, PGP failed!

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/wip-xlog-switch-zero-HEAD.patchtext/plainDownload
commit fba38257e52564276bb106d55aef14d0de481169
Author: Aidan Van Dyk <aidan@highrise.ca>
Date:   Fri Oct 31 12:35:24 2008 -0400

    WIP: Zero xlog tal on a forced switch
    
    If XLogWrite is called with xlog_switch, an XLog swithc has been force, either
    by a timeout based switch (archive_timeout), or an interactive force xlog
    switch (pg_switch_xlog/pg_stop_backup).  In those cases, we assume we can
    afford a little extra IO bandwidth to make xlogs so much more compressable

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 003098f..c6f9c79 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1600,6 +1600,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 */
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
+				/*
+				 * If we've had an xlog switch forced, then we want to zero
+				 * out the rest of the segment.  We zero it out here because at the
+				 * force switch time, IO bandwidth isn't a problem.
+				 *   -- AIDAN
+				 */
+				if (xlog_switch)
+				{
+					char buf[1024];
+					uint32 left = (XLogSegSize - openLogOff);
+					ereport(LOG,
+						(errmsg("ZEROING xlog file %u segment %u from %u - %u [%u bytes]",
+								openLogId, openLogSeg,
+								openLogOff, XLogSegSize, left)
+						 ));
+					memset(buf, 0, sizeof(buf));
+					while (left > 0)
+					{
+						size_t len = (left > sizeof(buf)) ? sizeof(buf) : left;
+						write(openLogFile, buf, len);
+						left -= len;
+					}
+				}
+
 				issue_xlog_fsync();
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#37)
Re: Improving compressibility of WAL files

Bruce Momjian <bruce@momjian.us> writes:

The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Isn't this redundant given the existence of pglesslog?

regards, tom lane

#39Aidan Van Dyk
aidan@highrise.ca
In reply to: Bruce Momjian (#37)
Re: Improving compressibility of WAL files

* Bruce Momjian <bruce@momjian.us> [090108 16:43]:

The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Would someone please eyeball it?; it is useful for compressing PITR
logs even if we find a better solution for replication streaming?

The reason I didn't push it was that people claimed it would chew up to
much WAL bandwidhh (causing a large commit latency) when the new 0's are
all written/fsynced at once...

I don't necessarily buy it, because the force_switch is usually either a
1) timeed occurance on an otherwise idle time
2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to
be hammered anyways...

But that's why I didn't follow up on it...

There's possible a few other ways to do it, such as zero the WAL on
recycling (but not fsyncing it), and hopefully most of the zero's get
trickled out by the OS before it comes down to a single 16MB fsync, but
not many people seemed too enthused about the whole WAL compressablitly
subject...

But, the way I see things going on -hackers, I must admit, sync-rep (WAL
streaming) looks like it's a long way off and possibly not even going to
do what I want, so *I* would really like this wal zero'ing...

If anybody has any specific things with the patch ehty think needs
chaning, I'll try and accomidate, but I do note that I never
submitted it for the Commitfest...

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#40Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Aidan Van Dyk (#39)
Re: Improving compressibility of WAL files

Aidan Van Dyk <aidan@highrise.ca> 01/08/09 5:02 PM >>>

*I* would really like this wal zero'ing...

pg_clearxlogtail (in pgfoundry) does exactly the same zeroing of the
tail as a filter. If you pipe through it on the way to gzip, there
is no increase in disk I/O over a straight gzip, and often an I/O
savings. Benchmarks of the final version showed no measurable
performance cost, even with full WAL files.

It's not as convenient to use as what your patch does, but it's not
all that hard either. There is also pglesslog, although we had
pg_clearxlogtail working before we found the other, so we've never
checked it out. Perhaps it does even better.

-Kevin

#41Hannu Krosing
hannu@krosing.net
In reply to: Aidan Van Dyk (#39)
Re: Improving compressibility of WAL files

On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote:

* Bruce Momjian <bruce@momjian.us> [090108 16:43]:

The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Would someone please eyeball it?; it is useful for compressing PITR
logs even if we find a better solution for replication streaming?

The reason I didn't push it was that people claimed it would chew up to
much WAL bandwidhh (causing a large commit latency) when the new 0's are
all written/fsynced at once...

I don't necessarily buy it, because the force_switch is usually either a
1) timeed occurance on an otherwise idle time
2) user-forced (i.e. forced checkpoint/pg_backup, so your IO is going to
be hammered anyways...

But that's why I didn't follow up on it...

There's possible a few other ways to do it, such as zero the WAL on
recycling (but not fsyncing it), and hopefully most of the zero's get
trickled out by the OS before it comes down to a single 16MB fsync, but
not many people seemed too enthused about the whole WAL compressablitly
subject...

But, the way I see things going on -hackers, I must admit, sync-rep (WAL
streaming) looks like it's a long way off and possibly not even going to
do what I want, so *I* would really like this wal zero'ing...

If anybody has any specific things with the patch ehty think needs
chaning, I'll try and accomidate, but I do note that I never
submitted it for the Commitfest...

won't it still be easier/less intrusive on inline core functionality and
more flexible to just record end-of-valid-wal somewhere and then let the
compressor discard the invalid part when compressing and recreate it
with zeros on decompression ?

-------------------
Hannu

#42Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Hannu Krosing (#41)
Re: Improving compressibility of WAL files

On Fri, 2009-01-09 at 01:29 +0200, Hannu Krosing wrote:

On Thu, 2009-01-08 at 18:02 -0500, Aidan Van Dyk wrote:

...

There's possible a few other ways to do it, such as zero the WAL on
recycling (but not fsyncing it), and hopefully most of the zero's get
trickled out by the OS before it comes down to a single 16MB fsync, but
not many people seemed too enthused about the whole WAL compressablitly
subject...

But, the way I see things going on -hackers, I must admit, sync-rep (WAL
streaming) looks like it's a long way off and possibly not even going to
do what I want, so *I* would really like this wal zero'ing...

If anybody has any specific things with the patch ehty think needs
chaning, I'll try and accomidate, but I do note that I never
submitted it for the Commitfest...

won't it still be easier/less intrusive on inline core functionality and
more flexible to just record end-of-valid-wal somewhere and then let the
compressor discard the invalid part when compressing and recreate it
with zeros on decompression ?

And some of the functionality already exists for in-process WAL files in
form of pg_current_xlog_location() and
pg_current_xlog_insert_location(), recording end-of-data in wal file
just extends this to completed log files.

--
------------------------------------------
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#43Greg Smith
gsmith@gregsmith.com
In reply to: Hannu Krosing (#41)
Re: Improving compressibility of WAL files

On Fri, 9 Jan 2009, Hannu Krosing wrote:

won't it still be easier/less intrusive on inline core functionality and
more flexible to just record end-of-valid-wal somewhere and then let the
compressor discard the invalid part when compressing and recreate it
with zeros on decompression ?

I thought at one point that the direction this was going toward was to
provide the size of the WAL file as a parameter you can use in the
archive_command: %p provides the path, %f the file name, and now %l the
length. That makes an example archive command something like:

head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"

Expanding it back to always be 16MB on the other side might require some
trivial script, can't think of a standard UNIX tool suitable for that but
it's easy enough to write. I'm assuming I just remembering someone else's
suggestion here, maybe I just invented the above. You don't want to just
modify pg_standby to accept small files, because then you've made it
harder to make absolutely sure when the file is ready to be processed if a
non-atomic copy is being done. And it may make sense to provide some
simple C implementations of the clear/expand tools in contrib even with
the %l addition, mainly to help out Windows users.

To reiterate the choices I remember popping up in the multiple rounds this
has come up, possible implementations that would work for this general
requirement include:

1) Provide the length as part of the archive command
2) Add a more explicit end-of-WAL delimiter
3) Write zeros to the unused portion in the server
4) pglesslog
5) pg_clearxlogtail

With "(6) use sync rep" being not quite a perfect answer; there are
certainly WAN-based use cases where you don't want full sync rep but do
want the WAL to compress as much as possible.

I think (1) is a better solution than most of these in the context of an
improvement to core, with (4) pglesslog being the main other contender
because of how it provides additional full-page write improvements.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#44Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#38)
Re: Improving compressibility of WAL files

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

The attached patch from Aidan Van Dyk zeros out the end of WAL files to
improve their compressibility. (The patch was originally sent to
'general' which explains why it was lost until now.)

Isn't this redundant given the existence of pglesslog?

It does the same as pglesslog, but is simpler to use because it is
automatic.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#44)
Re: Improving compressibility of WAL files

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Isn't this redundant given the existence of pglesslog?

It does the same as pglesslog, but is simpler to use because it is
automatic.

Which also means that everyone pays the performance penalty whether
they get any benefit or not. The point of the external solution
is to do the work only in installations that get some benefit.
We've been over this ground before...

regards, tom lane

#46Zeugswetter Andreas OSB sIT
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Greg Smith (#43)
Re: Improving compressibility of WAL files

You don't want to just
modify pg_standby to accept small files, because then you've made it
harder to make absolutely sure when the file is ready to be
processed if a non-atomic copy is being done.

It is hard, but I think it is the right way forward.
Anyway I think the size is not robust at all because some (most ? e.g. win32) non-atomic copy
implementations will also show the final size right from the beginning.

Could we use stricter file locking when opening WAL for recovery ?

Or implement a wait loop when the crc check (+ a basic validity check) for the next
record fails (and the next record is on a 512 byte boundary ?).
I think standby and restore recovery can be treated differently to startup recovery because
a copied wal file, even if the copy is not atomic, will not have trailing valid WAL records
from a recycled WAL. (A solution that recycles WAL files for restore/standby would need to make
sure it renames the files *after* restoring the content.)

Btw how do we detect end of WAL when restoring a backup and WAL after PANIC ?

1) Provide the length as part of the archive command

+1

Andreas

#47Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#45)
Re: Improving compressibility of WAL files

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Isn't this redundant given the existence of pglesslog?

It does the same as pglesslog, but is simpler to use because it is
automatic.

Which also means that everyone pays the performance penalty whether
they get any benefit or not. The point of the external solution
is to do the work only in installations that get some benefit.
We've been over this ground before...

If there is a performance penalty, you are right, but if the zeroing is
done as part of the archiving, it seems near zero cost enough to do it
all the time, no?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#48Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#43)
Re: Improving compressibility of WAL files

Greg Smith <gsmith@gregsmith.com> wrote:

I thought at one point that the direction this was going toward was

to

provide the size of the WAL file as a parameter you can use in the
archive_command: %p provides the path, %f the file name, and now %l

the

length. That makes an example archive command something like:

head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"

Hard to beat for performance. I thought there was some technical
snag.

Expanding it back to always be 16MB on the other side might require

some

trivial script, can't think of a standard UNIX tool suitable for that

but

it's easy enough to write.

Untested, but it seems like something close to this would work:

cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat
-c%s $p) )) )

-Kevin

#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#47)
Re: Improving compressibility of WAL files

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Which also means that everyone pays the performance penalty whether
they get any benefit or not. The point of the external solution
is to do the work only in installations that get some benefit.
We've been over this ground before...

If there is a performance penalty, you are right, but if the zeroing is
done as part of the archiving, it seems near zero cost enough to do it
all the time, no?

It's the same cost no matter which process does it.

regards, tom lane

#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#48)
Re: Improving compressibility of WAL files

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Greg Smith <gsmith@gregsmith.com> wrote:

I thought at one point that the direction this was going toward was to
provide the size of the WAL file as a parameter you can use in the
archive_command:

Hard to beat for performance. I thought there was some technical
snag.

Yeah: the archiver process doesn't have that information available.

regards, tom lane

#51Aidan Van Dyk
aidan@highrise.ca
In reply to: Kevin Grittner (#48)
Re: Improving compressibility of WAL files

All that is useless until we get a %l in archive_command...

*I* didn't see an easy way to get at the "written" size later on in the
chain (i.e. in the actual archiving), so I took the path of least
resitance.

The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that
they seem to possibly be frail... I'm just scared of somethign changing
in PG some time, and my pg_clearxlogtail not nowing, me forgetting to
upgrade, and me not doing enough test of my actually restoring backups...

Sure, it's all me being neglgent, but the simpler, the better...

If I wrapped this zeroing in GUC, people could choose wether to pay the
penalty or not, would that satisfy anyone?

Again, *I* think that the force_switch case is going to happen when the
admin's quite happy to pay that penalty... But obviously not
everyone...

a.

* Kevin Grittner <Kevin.Grittner@wicourts.gov> [090109 11:01]:

Greg Smith <gsmith@gregsmith.com> wrote:

I thought at one point that the direction this was going toward was

to

provide the size of the WAL file as a parameter you can use in the
archive_command: %p provides the path, %f the file name, and now %l

the

length. That makes an example archive command something like:

head -c "%l" "%p" | gzip > /mnt/server/archivedir/"%f"

Hard to beat for performance. I thought there was some technical
snag.

Expanding it back to always be 16MB on the other side might require

some

trivial script, can't think of a standard UNIX tool suitable for that

but

it's easy enough to write.

Untested, but it seems like something close to this would work:

cat $p $( dd if=/dev/null blocks=1 ibs=$(( (16 * 1024 * 1024) - $(stat
-c%s $p) )) )

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#52Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#47)
Re: Improving compressibility of WAL files

On Fri, 2009-01-09 at 09:31 -0500, Bruce Momjian wrote:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Isn't this redundant given the existence of pglesslog?

It does the same as pglesslog, but is simpler to use because it is
automatic.

Which also means that everyone pays the performance penalty whether
they get any benefit or not. The point of the external solution
is to do the work only in installations that get some benefit.
We've been over this ground before...

If there is a performance penalty, you are right, but if the zeroing is
done as part of the archiving, it seems near zero cost enough to do it
all the time, no?

It can already be done as part of the archiving, using an external tool
as Tom notes.

Yes, we could make the archiver do this, but I see no big advantage over
having it done externally. It's not faster, safer, easier. Not easier
because we would want a parameter to turn it off when not wanted.

The patch as stands is IMHO not acceptable because the work to zero the
file is performed by the unlucky backend that hits EOF on the current
WAL file, which is bad enough, but it is also performed while holding
WALWriteLock.

I like Greg Smith's analysis of this and his conclusion that we could
provide a %l option, but even that would require work to have that info
passed to the archiver. Perhaps inside the notification file which is
already written and read by the write processes. But whether that can or
should be done for this release is a different debate.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#53Aidan Van Dyk
aidan@highrise.ca
In reply to: Simon Riggs (#52)
Re: Improving compressibility of WAL files

* Simon Riggs <simon@2ndQuadrant.com> [090109 11:33]:

The patch as stands is IMHO not acceptable because the work to zero the
file is performed by the unlucky backend that hits EOF on the current
WAL file, which is bad enough, but it is also performed while holding
WALWriteLock.

Agreed, but noting that that extra zero work is contitional on the
"force_swich", meaning that commits backup behind that WALWriteLock only
during forced xlog switches (like archive_timeout and pg_backup). I
actually did look through verify that when I made the patch, although I
claim that verification to be something anybody else should beleive ;-)
But my given output when I showd the stats/lines/etc did demonstrate
that.

I like Greg Smith's analysis of this and his conclusion that we could
provide a %l option, but even that would require work to have that info
passed to the archiver. Perhaps inside the notification file which is
already written and read by the write processes. But whether that can or
should be done for this release is a different debate.

It's certainly not already in this commitfest, just like this patch. I
thought the initial reaction after I posted it made it pretty clear it
wasn't something people (other than a few of us) were willing to
allow...

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#54Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Aidan Van Dyk (#51)
Re: Improving compressibility of WAL files

Aidan Van Dyk <aidan@highrise.ca> 01/09/09 10:22 AM >>>

The reason *I* shy way from pg_lesslog and pg_clearxlogtail, is that
they seem to possibly be frail... I'm just scared of somethign
changing in PG some time, and my pg_clearxlogtail not nowing, me
forgetting to upgrade, and me not doing enough test of my actually
restoring backups...

A fair concern. I can't speak about pglesslog, but pg_clearxlogtail
goes out of its way to minimize this risk. Changes to log records
themselves can't break it; it is only dependent on the partitioning.
It bails with a message to stderr and a non-zero return code if it
finds anything obviously wrong. It also checks the WAL format for
which it was compiled against the WAL format on which it was invoked,
and issues a warning if they don't match. We ran into this once on a
machine running multiple releases of PostgreSQL where the archive
script invoked the wrong executable. It worked correctly in spite of
the warning, but the warning was enough to alert us to the mismatch
and change the path in the archive script.

-Kevin

#55Richard Huxton
dev@archonet.com
In reply to: Tom Lane (#50)
Re: Improving compressibility of WAL files

Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Greg Smith <gsmith@gregsmith.com> wrote:

I thought at one point that the direction this was going toward was to
provide the size of the WAL file as a parameter you can use in the
archive_command:

Hard to beat for performance. I thought there was some technical
snag.

Yeah: the archiver process doesn't have that information available.

Am I being really dim here - why isn't the first record in the WAL file
a fixed-length record containing e.g. txid_start, time_start, txid_end,
time_end, length? Write it once when you start using the file and once
when it's finished.

--
Richard Huxton
Archonet Ltd

#56Aidan Van Dyk
aidan@highrise.ca
In reply to: Richard Huxton (#55)
Re: Improving compressibility of WAL files

* Richard Huxton <dev@archonet.com> [090109 12:22]:

Yeah: the archiver process doesn't have that information available.

Am I being really dim here - why isn't the first record in the WAL file
a fixed-length record containing e.g. txid_start, time_start, txid_end,
time_end, length? Write it once when you start using the file and once
when it's finished.

It would break the WAL "write-block/sync-block" forward only progress of
the xlog, which avoids the whole torn-page problem that the heap has.

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#57Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#50)
Re: Improving compressibility of WAL files

Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Greg Smith <gsmith@gregsmith.com> wrote:

I thought at one point that the direction this was going toward was to
provide the size of the WAL file as a parameter you can use in the
archive_command:

Hard to beat for performance. I thought there was some technical
snag.

Yeah: the archiver process doesn't have that information available.

OK, thanks, I understand now.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#58Richard Huxton
dev@archonet.com
In reply to: Aidan Van Dyk (#56)
Re: Improving compressibility of WAL files

Aidan Van Dyk wrote:

* Richard Huxton <dev@archonet.com> [090109 12:22]:

Yeah: the archiver process doesn't have that information available.

Am I being really dim here - why isn't the first record in the WAL file
a fixed-length record containing e.g. txid_start, time_start, txid_end,
time_end, length? Write it once when you start using the file and once
when it's finished.

It would break the WAL "write-block/sync-block" forward only progress of
the xlog, which avoids the whole torn-page problem that the heap has.

I thought that only applied when the filesystem page-size was less than
the data we were writing?

--
Richard Huxton
Archonet Ltd

#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#52)
Re: Improving compressibility of WAL files

Simon Riggs <simon@2ndQuadrant.com> writes:

Yes, we could make the archiver do this, but I see no big advantage over
having it done externally. It's not faster, safer, easier. Not easier
because we would want a parameter to turn it off when not wanted.

And the other question to ask is how much effort and code should we be
putting into the concept anyway. AFAICS, file-at-a-time WAL shipping
is a stopgap implementation that will be dead as a doornail once the
current efforts towards realtime replication are finished. There will
still be some use for forced log switches in connection with backups,
but that's not going to occur often enough to be important to optimize.

regards, tom lane

#60Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#59)
Re: Improving compressibility of WAL files

Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAICS, file-at-a-time WAL shipping
is a stopgap implementation that will be dead as a doornail once the
current efforts towards realtime replication are finished.

As long as there is a way to rsync log data to multiple targets not
running replicas, with compression because of low-speed WAN
connections, I'm happy. Doesn't matter whether that is using existing
techniques or the new realtime techniques.

-Kevin

#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#59)
Re: Improving compressibility of WAL files

On Fri, 2009-01-09 at 13:22 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Yes, we could make the archiver do this, but I see no big advantage over
having it done externally. It's not faster, safer, easier. Not easier
because we would want a parameter to turn it off when not wanted.

And the other question to ask is how much effort and code should we be
putting into the concept anyway. AFAICS, file-at-a-time WAL shipping
is a stopgap implementation that will be dead as a doornail once the
current efforts towards realtime replication are finished. There will
still be some use for forced log switches in connection with backups,
but that's not going to occur often enough to be important to optimize.

Agreed.

Half-filled WAL files were necessary to honour archive_timeout. With
continuous streaming all WAL files will be 100% full before we switch,
for most purposes.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#62Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#61)
Re: Improving compressibility of WAL files

On Fri, 9 Jan 2009, Simon Riggs wrote:

Half-filled WAL files were necessary to honour archive_timeout. With
continuous streaming all WAL files will be 100% full before we switch,
for most purposes.

The main use case I'm concerned about losing support for is:

1) Two systems connected by a WAN with significant transmit latency
2) The secondary system runs a warm standby aimed at disaster recovery
3) Business requirements want the standby to never be more than (say) 5
minutes behind the primary, presuming the WAN is up
4) WAN traffic is "expensive" (money==bandwidth, one of the two is scarce)

This seems a pretty common scenario in my experience. Right now, this
case is served quite well like this:

-archive_timeout='5 minutes'
-[pglesslog|pg_clearxlogtail] | gzip | rsync

The main concern I have with switching to a more synchronous scheme is
that network efficiency drops as the payload breaks into smaller pieces.
I haven't had enough time to keep up with all the sync rep advances
recently to know for sure if there's a configuration there that's suitable
for this case. If that can be configured to send only in relatively large
chunks, while still never letting things lag too far behind, then I'd
agree completely that the case for any of these WAL cleaner utilities is
dead--presuming said support makes it into the next release.

If that's not available, say because the only useful option sends in very
small pieces, there may still be a need for some utility to fill in for
this particular requirement. Luckily there are many to choose from if it
comes to that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#63Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#62)
Re: Improving compressibility of WAL files

Greg Smith <gsmith@gregsmith.com> wrote:

The main use case I'm concerned about losing support for is:

1) Two systems connected by a WAN with significant transmit latency
2) The secondary system runs a warm standby aimed at disaster

recovery

3) Business requirements want the standby to never be more than (say)

5

minutes behind the primary, presuming the WAN is up
4) WAN traffic is "expensive" (money==bandwidth, one of the two is

scarce)

This seems a pretty common scenario in my experience. Right now,

this

case is served quite well like this:

-archive_timeout='5 minutes'
-[pglesslog|pg_clearxlogtail] | gzip | rsync

You've come pretty close to describing our environment, other than
having 72 primaries each using rsync to push the WAL files to another
server at the same site while a server at the central site uses rsync
to pull them back. We don't run warm standby on the backup server at
the site of origin, and don't want to have to do so.

It is critically important that the flow of xlog data never hold up
the primary databases, and that failure to copy xlog to either of the
targets not interfere with copying to the other. (We have WAN
failures surprising often, sometimes for days at a time, and the
backup server on-site is in the same rack of the same cabinet as the
database server.)

Compression of xlog data is important not only for WAN transmission,
but for storage space. We keep two weeks of WAL files to allow
recovery from either of the last two weekly backups, and we archive
the first weekly backup of each month, with the WAL files needed for
recovery, for one year.

So it appears we care about somewhat similar issues.

-Kevin

#64Greg Smith
gsmith@gregsmith.com
In reply to: Aidan Van Dyk (#51)
Re: Improving compressibility of WAL files

On Fri, 9 Jan 2009, Aidan Van Dyk wrote:

*I* didn't see an easy way to get at the "written" size later on in the
chain (i.e. in the actual archiving), so I took the path of least
resitance.

I was hoping it might fall out of the other work being done in that area,
given how much that code is still being poked at right now. As Hannu
pointed out, from a conceptual level you just need to carry along the same
information that pg_current_xlog_location() returns to the archiver on all
the paths where a segment might end early.

If I wrapped this zeroing in GUC, people could choose wether to pay the
penalty or not, would that satisfy anyone?

Rather than creating a whole new GUC, it might it be possible to turn
archive_mode into an enum setting: off, on, and cleaned as the modes
perhaps. That would avoid making a new setting, with the downside that a
bunch of critical code would look less clear than it does with a boolean.

Again, *I* think that the force_switch case is going to happen when the
admin's quite happy to pay that penalty... But obviously not
everyone...

I understand the case you've made for why it doesn't matter, and for
almost every case you're right. The situation it may be vulnerable to is
where a burst of transactions come in just as the archive timeout expires
after minimal WAL activity. There I think you can end up with a bunch of
clients waiting behind an almost full zero fill operation, which pushes up
the worst-case latency. I've been able to measure the impact of the
similar case where zero-filling a brand new segment can impact things;
this would be much less like to happen because the timing would have to
line up just wrong, but I think it's still possible.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#65Aidan Van Dyk
aidan@highrise.ca
In reply to: Greg Smith (#64)
Re: Improving compressibility of WAL files

* Greg Smith <gsmith@gregsmith.com> [090109 18:39]:

I was hoping it might fall out of the other work being done in that area,
given how much that code is still being poked at right now. As Hannu
pointed out, from a conceptual level you just need to carry along the
same information that pg_current_xlog_location() returns to the archiver
on all the paths where a segment might end early.

I was(am) also hoping that somethig falls out of sync-rep that gives me
better PITR backups (better than a small archive_timeout)... That hope
is what made me abandon this patch after the initial feedback.

Rather than creating a whole new GUC, it might it be possible to turn
archive_mode into an enum setting: off, on, and cleaned as the modes
perhaps. That would avoid making a new setting, with the downside that a
bunch of critical code would look less clear than it does with a boolean.

I'm content to wait and see what falls out of sync-rep stuff...

... for now ...

I understand the case you've made for why it doesn't matter, and for
almost every case you're right. The situation it may be vulnerable to is
where a burst of transactions come in just as the archive timeout expires
after minimal WAL activity. There I think you can end up with a bunch of
clients waiting behind an almost full zero fill operation, which pushes
up the worst-case latency. I've been able to measure the impact of the
similar case where zero-filling a brand new segment can impact things;
this would be much less like to happen because the timing would have to
line up just wrong, but I think it's still possible.

Ya, and it's one of just many of the times PG hits these worst-latency
spikes ;-) GEnerally, it's *very* good... and once in a while, when all
the stars line up correctly, you get these spikes....

But even with these spikes, it's plenty fast enough for the stuff I've
done...

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.