Full page writes improvement, code update

Started by Koichi Suzukiabout 19 years ago69 messageshackers
Jump to latest
#1Koichi Suzuki
koichi.szk@gmail.com

Hi,

Here's an update of a code to improve full page writes as proposed in

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01491.php
and
http://archives.postgresql.org/pgsql-patches/2007-01/msg00607.php

Update includes some modification for error handling in archiver and
restoration command.

In the previous threads, I posted several evaluation and shown that we
can keep all the full page writes needed for full XLOG crash recovery,
while compressing archive log size considerably better than gzip, with
less CPU consumption. I've found no further objection for this proposal
but still would like to hear comments/opinions/advices.

Regards;

--
Koichi Suzuki

Attachments:

pg_lesslog.tgzapplication/octet-stream; name=pg_lesslog.tgzDownload
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Koichi Suzuki (#1)
Re: Full page writes improvement, code update

On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote:

Here's an update of a code to improve full page writes as proposed in

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01491.php
and
http://archives.postgresql.org/pgsql-patches/2007-01/msg00607.php

Update includes some modification for error handling in archiver and
restoration command.

In the previous threads, I posted several evaluation and shown that we
can keep all the full page writes needed for full XLOG crash recovery,
while compressing archive log size considerably better than gzip, with
less CPU consumption. I've found no further objection for this proposal
but still would like to hear comments/opinions/advices.

Koichi-san,

Looks interesting. I like the small amount of code to do this.

A few thoughts:

- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)

- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.

- I'd rather mark just the nonremovable blocks. But no real difference

- We definitely don't want an normal elog in a XLogInsert critical
section, especially at DEBUG1 level

- diff -c format is easier and the standard

- pg_logarchive and pg_logrestore should be called by a name that
reflects what they actually do. Possibly pg_compresslog and
pg_decompresslog etc.. I've not reviewed those programs, but:

- Not sure why we have to compress away page headers. Touch as little as
you can has always been my thinking with recovery code.

- I'm very uncomfortable with touching the LSN. Maybe I misunderstand?

- Have you thought about how pg_standby would integrate with this
option? Can you please?

- I'll do some docs for this after Freeze, if you'd like. I have some
other changes to make there, so I can do this at the same time.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#3Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#2)
Re: [PATCHES] Full page writes improvement, code update

Simon;

Thanks a lot for your comments/advices. I'd like to write some feedback.

Simon Riggs wrote:

On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote:

Here's an update of a code to improve full page writes as proposed in

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01491.php
and
http://archives.postgresql.org/pgsql-patches/2007-01/msg00607.php

Update includes some modification for error handling in archiver and
restoration command.

In the previous threads, I posted several evaluation and shown that we
can keep all the full page writes needed for full XLOG crash recovery,
while compressing archive log size considerably better than gzip, with
less CPU consumption. I've found no further objection for this proposal
but still would like to hear comments/opinions/advices.

Koichi-san,

Looks interesting. I like the small amount of code to do this.

A few thoughts:

- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)

No, no one asked to have a separate option. There'll be no bad
influence to do so. As written below, full page write can be
categolized as follows:

1) Needed for crash recovery: first page update after each checkpoint.
This has to be kept in WAL.

2) Needed for archive recovery: page update between pg_start_backup and
pg_stop_backup. This has to be kept in archive log.

3) For log-shipping slave such as pg_standby: no full page writes will
be needed for this purpose.

My proposal deals with 2). So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.

- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.

Only one thing I had to do is to create "dummy" full page write to
maintain LSNs. Full page writes are omitted in archive log. We have to
LSNs same as those in the original WAL. In this case, recovery has to
read logical log, not "dummy" full page writes. On the other hand, if
both logical log and "real" full page writes are found in a log record,
the recovery has to use "real" full page writes.

- I'd rather mark just the nonremovable blocks. But no real difference

It sound nicer. According to the full page write categories above, we
can mark full page writes as needed in "crash recovery" or "archive
recovery". Please give some feedback to the above full page write
categories.

- We definitely don't want an normal elog in a XLogInsert critical
section, especially at DEBUG1 level

I agree. I'll remove elog calls from critical sections.

- diff -c format is easier and the standard

I'll change the diff option.

- pg_logarchive and pg_logrestore should be called by a name that
reflects what they actually do. Possibly pg_compresslog and
pg_decompresslog etc.. I've not reviewed those programs, but:

I wasn't careful to have command names more based on what to be done.
I'll change the command name.

- Not sure why we have to compress away page headers. Touch as little as
you can has always been my thinking with recovery code.

Eliminating page headers gives compression rate slightly better, a
couple of percents. To make code simpler, I'll drop this compression.

- I'm very uncomfortable with touching the LSN. Maybe I misunderstand?

The reason why we need pg_logarchive (or pg_decompresslog) is to
maintain LSN the same as those in the original WAL. For this purpose,
we need "dummy" full page write. I don't like to touch LSN either and
this is the reason why pg_decompresslog need some extra work,
eliminating many other changes in the recovery.

- Have you thought about how pg_standby would integrate with this
option? Can you please?

Yes I believe so. As pg_standby does not include any chance to meet
partial writes of pages, I believe you can omit all the full page
writes. Of course, as Tom Lange suggested in
http://archives.postgresql.org/pgsql-hackers/2007-02/msg00034.php
removing full page writes can lose a chance to recover from
partial/inconsisitent writes in the crash of pg_standby. In this case,
we have to import a backup and archive logs (with full page writes
during the backup) to recover. (We have to import them when the file
system crashes anyway). If it's okay, I believe
pg_compresslog/pg_decompresslog can be integrated with pg_standby.

Maybe we can work together to include pg_compresslog/pg_decompresslog in
pg_standby.

- I'll do some docs for this after Freeze, if you'd like. I have some
other changes to make there, so I can do this at the same time.

I'll be looking forward to your writings.

Best regards;

--
Koichi Suzuki

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Koichi Suzuki (#3)
Re: [PATCHES] Full page writes improvement, code update

On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote:

As written below, full page write can be
categolized as follows:

1) Needed for crash recovery: first page update after each checkpoint.
This has to be kept in WAL.

2) Needed for archive recovery: page update between pg_start_backup and
pg_stop_backup. This has to be kept in archive log.

3) For log-shipping slave such as pg_standby: no full page writes will
be needed for this purpose.

My proposal deals with 2). So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.

I'm very happy with this proposal, though I do still have some points in
detailed areas.

If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are
also valid goals, ISTM. i.e. you might choose to use full_page_writes on
the primary and yet would like to see optimised data transfer to the
standby server. In that case, you would need the mark.

- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)

No, no one asked to have a separate option. There'll be no bad
influence to do so. So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.

OK, different question:
Why would anyone ever set full_page_compress = off?

Why have a parameter that does so little? ISTM this is:

i) one more thing to get wrong

ii) cheaper to mark the block when appropriate than to perform the if()
test each time. That can be done only in the path where backup blocks
are present.

iii) If we mark the blocks every time, it allows us to do an offline WAL
compression. If the blocks aren't marked that option is lost. The bit is
useful information, so we should have it in all cases.

- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.

Only one thing I had to do is to create "dummy" full page write to
maintain LSNs. Full page writes are omitted in archive log. We have to
LSNs same as those in the original WAL. In this case, recovery has to
read logical log, not "dummy" full page writes. On the other hand, if
both logical log and "real" full page writes are found in a log record,
the recovery has to use "real" full page writes.

I apologise for not understanding your reply, perhaps my original
request was unclear.

In recovery.conf, I'd like to see a parameter such as

dummy_backup_blocks = off (default) | on

to explicitly indicate to the recovery process that backup blocks are
present, yet they are garbage and should be ignored. Having garbage data
within the system is potentially dangerous and I want to be told by the
user that they were expecting that and its OK to ignore that data.
Otherwise I want to throw informative errors. Maybe it seems OK now, but
the next change to the system may have unintended consequences and it
may not be us making the change. "It's OK the Alien will never escape
from the lab" is the starting premise for many good sci-fi horrors and I
want to watch them, not be in one myself. :-)

We can call it other things, of course. e.g.
ignore_dummy_blocks
decompressed_blocks
apply_backup_blocks

Yes I believe so. As pg_standby does not include any chance to meet
partial writes of pages, I believe you can omit all the full page
writes. Of course, as Tom Lange suggested in
http://archives.postgresql.org/pgsql-hackers/2007-02/msg00034.php
removing full page writes can lose a chance to recover from
partial/inconsisitent writes in the crash of pg_standby. In this case,
we have to import a backup and archive logs (with full page writes
during the backup) to recover. (We have to import them when the file
system crashes anyway). If it's okay, I believe
pg_compresslog/pg_decompresslog can be integrated with pg_standby.

Maybe we can work together to include pg_compresslog/pg_decompresslog in
pg_standby.

ISTM there are two options.

I think this option is already possible:

1. Allow pg_decompresslog to operate on a file, replacing it with the
expanded form, like gunzip, so we would do this:
restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog
decomp.tmp %p'

though the decomp.tmp file would not get properly initialised or cleaned
up when we finish.

whereas this will take additional work

2. Allow pg_standby to write to stdin, so that we can do this:
restore_command = 'pg_standby %f | pg_decompresslog - %p'

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#5Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#4)
Re: [PATCHES] Full page writes improvement, code update

Hi, Here're some feedback to the comment:

Simon Riggs wrote:

On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote:

As written below, full page write can be
categolized as follows:

1) Needed for crash recovery: first page update after each checkpoint.
This has to be kept in WAL.

2) Needed for archive recovery: page update between pg_start_backup and
pg_stop_backup. This has to be kept in archive log.

3) For log-shipping slave such as pg_standby: no full page writes will
be needed for this purpose.

My proposal deals with 2). So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.

I'm very happy with this proposal, though I do still have some points in
detailed areas.

If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are
also valid goals, ISTM. i.e. you might choose to use full_page_writes on
the primary and yet would like to see optimised data transfer to the
standby server. In that case, you would need the mark.

Yes, I need the mark. In my proposal, only unmarked full-page-writes,
which were written as the first update after a checkpoint, are to be
removed offline (pg_compresslog).

- Not sure why we need "full_page_compress", why not just mark them
always? That harms noone. (Did someone else ask for that? If so, keep
it)

No, no one asked to have a separate option. There'll be no bad
influence to do so. So, if we mark each "full_page_write", I'd
rather mark when this is needed. Still need only one bit because the
case 3) does not need any mark.

OK, different question:
Why would anyone ever set full_page_compress = off?

Why have a parameter that does so little? ISTM this is:

i) one more thing to get wrong

ii) cheaper to mark the block when appropriate than to perform the if()
test each time. That can be done only in the path where backup blocks
are present.

iii) If we mark the blocks every time, it allows us to do an offline WAL
compression. If the blocks aren't marked that option is lost. The bit is
useful information, so we should have it in all cases.

Not only full-page-writes are written as WAL record. In my proposal,
both full-page-writes and logical log are written in a WAL record, which
will make WAL size slightly bigger (five percent or so). If
full_page_compress = off, only a full-page-write will be written in a
WAL record. I thought someone will not be happy with this size growth.

I agree to make this mandatory if every body is happy with extra logical
log in WAL records with full page writes.

I'd like to have your opinion.

- OTOH I'd like to see an explicit parameter set during recovery since
you're asking the main recovery path to act differently in case a single
bit is set/unset. If you are using that form of recovery, we should say
so explicitly, to keep everybody else safe.

Only one thing I had to do is to create "dummy" full page write to
maintain LSNs. Full page writes are omitted in archive log. We have to
LSNs same as those in the original WAL. In this case, recovery has to
read logical log, not "dummy" full page writes. On the other hand, if
both logical log and "real" full page writes are found in a log record,
the recovery has to use "real" full page writes.

I apologise for not understanding your reply, perhaps my original
request was unclear.

In recovery.conf, I'd like to see a parameter such as

dummy_backup_blocks = off (default) | on

to explicitly indicate to the recovery process that backup blocks are
present, yet they are garbage and should be ignored. Having garbage data
within the system is potentially dangerous and I want to be told by the
user that they were expecting that and its OK to ignore that data.
Otherwise I want to throw informative errors. Maybe it seems OK now, but
the next change to the system may have unintended consequences and it
may not be us making the change. "It's OK the Alien will never escape
from the lab" is the starting premise for many good sci-fi horrors and I
want to watch them, not be in one myself. :-)

We can call it other things, of course. e.g.
ignore_dummy_blocks
decompressed_blocks
apply_backup_blocks

So far, we don't need any modification to the recovery and redo
functions. They ignore the dummy and apply logical logs. Also, if
there are both full page writes and logical log, current recovery
selects full page writes to apply.

I agree to introduce this option if 8.3 code introduces any conflict to
the current. Or, we could introduce this option for future safety. Do
you think we should introduce this option?

If this should be introduced now, what we should do is to check this
option when dummy full-page-write appears.

Yes I believe so. As pg_standby does not include any chance to meet
partial writes of pages, I believe you can omit all the full page
writes. Of course, as Tom Lange suggested in
http://archives.postgresql.org/pgsql-hackers/2007-02/msg00034.php
removing full page writes can lose a chance to recover from
partial/inconsisitent writes in the crash of pg_standby. In this case,
we have to import a backup and archive logs (with full page writes
during the backup) to recover. (We have to import them when the file
system crashes anyway). If it's okay, I believe
pg_compresslog/pg_decompresslog can be integrated with pg_standby.

Maybe we can work together to include pg_compresslog/pg_decompresslog in
pg_standby.

ISTM there are two options.

I think this option is already possible:

1. Allow pg_decompresslog to operate on a file, replacing it with the
expanded form, like gunzip, so we would do this:
restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog
decomp.tmp %p'

though the decomp.tmp file would not get properly initialised or cleaned
up when we finish.

whereas this will take additional work

2. Allow pg_standby to write to stdin, so that we can do this:
restore_command = 'pg_standby %f | pg_decompresslog - %p'

Both seem to work fine. pg_decompresslog will read entire file at the
beginning and so both two will be equivallent. To make the second
option run quicker, pg_decompresslog needs some modification to read WAL
record one after another.

Anyway, could you try to run pg_standby with pg_compresslog and
pg_decompresslog?

----
Additional recomment on page header removal:

I found that it is not simple to keep page header in the compressed
archive log. Because we eliminate unmarked full page writes and shift
the rest of the WAL file data, it is not simple to keep page header as
the page header in the compressed archive log. It is much simpler to
remove page header as well and rebuild them. I'd like to keep current
implementation in this point.

Any suggestions are welcome.
-------------------

I'll modify the name of the commands and post it hopefully within 20hours.

With Best Regards;

--
Koichi Suzuki

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Koichi Suzuki (#5)
Re: [PATCHES] Full page writes improvement, code update

On Thu, 2007-03-29 at 17:50 +0900, Koichi Suzuki wrote:

Not only full-page-writes are written as WAL record. In my proposal,
both full-page-writes and logical log are written in a WAL record, which
will make WAL size slightly bigger (five percent or so). If
full_page_compress = off, only a full-page-write will be written in a
WAL record. I thought someone will not be happy with this size growth.

OK, I see what you're doing now and agree with you that we do need a
parameter. Not sure about the name you've chosen though - it certainly
confused me until you explained.

A parameter called ..._compress indicates to me that it would reduce
something in size whereas what it actually does is increase the size of
WAL slightly. We should have a parameter name that indicates what it
actually does, otherwise some people will choose to use this parameter
even when they are not using archive_command with pg_compresslog.

Some possible names...

additional_wal_info = 'COMPRESS'
add_wal_info
wal_additional_info
wal_auxiliary_info
wal_extra_data
attach_wal_info
...
others?

I've got some ideas for the future for adding additional WAL info for
various purposes, so it might be useful to have a parameter that can
cater for multiple types of additional WAL data. Or maybe we go for
something more specific like

wal_add_compress_info = on
wal_add_XXXX_info ...

In recovery.conf, I'd like to see a parameter such as

dummy_backup_blocks = off (default) | on

to explicitly indicate to the recovery process that backup blocks are
present, yet they are garbage and should be ignored. Having garbage data
within the system is potentially dangerous and I want to be told by the
user that they were expecting that and its OK to ignore that data.
Otherwise I want to throw informative errors. Maybe it seems OK now, but
the next change to the system may have unintended consequences and it
may not be us making the change. "It's OK the Alien will never escape
from the lab" is the starting premise for many good sci-fi horrors and I
want to watch them, not be in one myself. :-)

We can call it other things, of course. e.g.
ignore_dummy_blocks
decompressed_blocks
apply_backup_blocks

So far, we don't need any modification to the recovery and redo
functions. They ignore the dummy and apply logical logs. Also, if
there are both full page writes and logical log, current recovery
selects full page writes to apply.

I agree to introduce this option if 8.3 code introduces any conflict to
the current. Or, we could introduce this option for future safety. Do
you think we should introduce this option?

Yes. You are skipping a correctness test and that should be by explicit
command only. It's no problem to include that as well, since you are
already having to specify pg_... decompress... but the recovery process
doesn't know whether or not you've done that.

Anyway, could you try to run pg_standby with pg_compresslog and
pg_decompresslog?

After freeze, yes.

----
Additional recomment on page header removal:

I found that it is not simple to keep page header in the compressed
archive log. Because we eliminate unmarked full page writes and shift
the rest of the WAL file data, it is not simple to keep page header as
the page header in the compressed archive log. It is much simpler to
remove page header as well and rebuild them. I'd like to keep current
implementation in this point.

OK.

This is a good feature. Thanks for your patience with my comments.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#7Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#4)
Re: [PATCHES] Full page writes improvement, code update

Simon,

OK, different question:
Why would anyone ever set full_page_compress = off?

The only reason I can see is if compression costs us CPU but gains RAM &
I/O. I can think of a lot of applications ... benchmarks included ...
which are CPU-bound but not RAM or I/O bound. For those applications,
compression is a bad tradeoff.

If, however, CPU used for compression is made up elsewhere through smaller
file processing, then I'd agree that we don't need a switch.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#7)
Re: [PATCHES] Full page writes improvement, code update

On Thu, 2007-03-29 at 11:45 -0700, Josh Berkus wrote:

OK, different question:
Why would anyone ever set full_page_compress = off?

The only reason I can see is if compression costs us CPU but gains RAM &
I/O. I can think of a lot of applications ... benchmarks included ...
which are CPU-bound but not RAM or I/O bound. For those applications,
compression is a bad tradeoff.

If, however, CPU used for compression is made up elsewhere through smaller
file processing, then I'd agree that we don't need a switch.

Koichi-san has explained things for me now.

I misunderstood what the parameter did and reading your post, ISTM you
have as well. I do hope Koichi-san will alter the name to allow
everybody to understand what it does.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#9Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#8)
Re: [PATCHES] Full page writes improvement, code update

Josh;

I'd like to explain what the term "compression" in my proposal means
again and would like to show the resource consumption comparision with
cp and gzip.

My proposal is to remove unnecessary full page writes (they are needed
in crash recovery from inconsistent or partial writes) when we copy WAL
to archive log and rebuilt them as a dummy when we restore from archive
log. Dummy is needed to maintain LSN. So it is very very different
from general purpose compression such as gzip, although pg_compresslog
compresses archive log as a result.

As to CPU and I/O consumption, I've already evaluated as follows:

1) Collect all the WAL segment.
2) Copy them by different means, cp, pg_compresslog and gzip.

and compared the ellapsed time as well as other resource consumption.

Benchmark: DBT-2
Database size: 120WH (12.3GB)
Total WAL size: 4.2GB (after 60min. run)
Elapsed time:
cp: 120.6sec
gzip: 590.0sec
pg_compresslog: 79.4sec
Resultant archive log size:
cp: 4.2GB
gzip: 2.2GB
pg_compresslog: 0.3GB
Resource consumption:
cp: user: 0.5sec system: 15.8sec idle: 16.9sec I/O wait: 87.7sec
gzip: user: 286.2sec system: 8.6sec idle: 260.5sec I/O wait: 36.0sec
pg_compresslog:
user: 7.9sec system: 5.5sec idle: 37.8sec I/O wait: 28.4sec

Because the resultant log size is considerably smaller than cp or gzip,
pg_compresslog need much less I/O and because the logic is much simpler
than gzip, it does not consume CPU.

The term "compress" may not be appropriate. We may call this "log
optimization" instead.

So I don't see any reason why this (at least optimization "mark" in each
log record) can't be integrated.

Simon Riggs wrote:

On Thu, 2007-03-29 at 11:45 -0700, Josh Berkus wrote:

OK, different question:
Why would anyone ever set full_page_compress = off?

The only reason I can see is if compression costs us CPU but gains RAM &
I/O. I can think of a lot of applications ... benchmarks included ...
which are CPU-bound but not RAM or I/O bound. For those applications,
compression is a bad tradeoff.

If, however, CPU used for compression is made up elsewhere through smaller
file processing, then I'd agree that we don't need a switch.

As I wrote to Simon's comment, I concern only one thing.

Without a switch, because both full page writes and corresponding
logical log is included in WAL, this will increase WAL size slightly
(maybe about five percent or so). If everybody is happy with this, we
don't need a switch.

Koichi-san has explained things for me now.

I misunderstood what the parameter did and reading your post, ISTM you
have as well. I do hope Koichi-san will alter the name to allow
everybody to understand what it does.

Here're some candidates:
full_page_writes_optimize
full_page_writes_mark: means it marks full_page_write as "needed in
crash recovery", "needed in archive recovery" and so on.

I don't insist these names. It's very helpful if you have any
suggestion to reflect what it really means.

Regards;
--
Koichi Suzuki

#10Koichi Suzuki
koichi.szk@gmail.com
In reply to: Koichi Suzuki (#5)
Re: [PATCHES] Full page writes improvement, code update

Hi,

Here's a patch reflected some of Simon's comments.

1) Removed an elog call in a critical section.

2) Changed the name of the commands, pg_complesslog and pg_decompresslog.

3) Changed diff option to make a patch.

--
Koichi Suzuki

Attachments:

pg_lesslog.tgzapplication/octet-stream; name=pg_lesslog.tgzDownload
#11Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#9)
Re: [PATCHES] Full page writes improvement, code update

Without a switch, because both full page writes and
corresponding logical log is included in WAL, this will
increase WAL size slightly
(maybe about five percent or so). If everybody is happy
with this, we
don't need a switch.

Sorry, I still don't understand that. What is the "corresponding logical
log" ?
It seems to me, that a full page WAL record has enough info to produce a

dummy LSN WAL entry. So insead of just cutting the full page wal record
you
could replace it with a LSN WAL entry when archiving the log.

Then all that is needed is the one flag, no extra space ?

Andreas

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#11)
Re: [PATCHES] Full page writes improvement, code update

On Fri, 2007-03-30 at 10:22 +0200, Zeugswetter Andreas ADI SD wrote:

Without a switch, because both full page writes and
corresponding logical log is included in WAL, this will
increase WAL size slightly
(maybe about five percent or so). If everybody is happy
with this, we
don't need a switch.

Sorry, I still don't understand that. What is the "corresponding logical
log" ?
It seems to me, that a full page WAL record has enough info to produce a

dummy LSN WAL entry. So insead of just cutting the full page wal record
you
could replace it with a LSN WAL entry when archiving the log.

Then all that is needed is the one flag, no extra space ?

The full page write is required for crash recovery, but that isn't
required during archive recovery because the base backup provides the
safe base. Archive recovery needs the normal xlog record, which in some
cases has been optimised away because the backup block is present, since
the full block already contains the changes.

If you want to remove the backup blocks, you need to put back the
information that was optimised away, otherwise you won't be able to do
the archive recovery correctly. Hence a slight increase in WAL volume to
allow it to be compressed does make sense.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#13Richard Huxton
dev@archonet.com
In reply to: Simon Riggs (#12)
Re: [PATCHES] Full page writes improvement, code update

Simon Riggs wrote:

On Fri, 2007-03-30 at 10:22 +0200, Zeugswetter Andreas ADI SD wrote:

Without a switch, because both full page writes and
corresponding logical log is included in WAL, this will
increase WAL size slightly
(maybe about five percent or so). If everybody is happy
with this, we
don't need a switch.

Sorry, I still don't understand that. What is the "corresponding logical
log" ?
It seems to me, that a full page WAL record has enough info to produce a

dummy LSN WAL entry. So insead of just cutting the full page wal record
you
could replace it with a LSN WAL entry when archiving the log.

Then all that is needed is the one flag, no extra space ?

The full page write is required for crash recovery, but that isn't
required during archive recovery because the base backup provides the
safe base.

Is that always true? Could the backup not pick up a partially-written
page? Assuming it's being written to as the backup is in progress. (We
are talking about when disk blocks are smaller than PG blocks here, so
can't guarantee an atomic write for a PG block?)

--
Richard Huxton
Archonet Ltd

#14Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#12)
Re: [PATCHES] Full page writes improvement, code update

Archive recovery needs the
normal xlog record, which in some cases has been optimised
away because the backup block is present, since the full
block already contains the changes.

Aah, I didn't know that optimization exists.
I agree that removing that optimization is good/ok.

Andreas

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Richard Huxton (#13)
Re: [PATCHES] Full page writes improvement, code update

On Fri, 2007-03-30 at 11:27 +0100, Richard Huxton wrote:

Is that always true? Could the backup not pick up a partially-written
page? Assuming it's being written to as the backup is in progress. (We
are talking about when disk blocks are smaller than PG blocks here, so
can't guarantee an atomic write for a PG block?)

Any page written during a backup has a backup block that would not be
removable by Koichi's tool, so yes, you'd still be safe.

i.e. between pg_start_backup() and pg_stop_backup() we always use full
page writes, even if you are running in full_page_writes=off mode.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#16Richard Huxton
dev@archonet.com
In reply to: Simon Riggs (#15)
Re: [PATCHES] Full page writes improvement, code update

Simon Riggs wrote:

On Fri, 2007-03-30 at 11:27 +0100, Richard Huxton wrote:

Is that always true? Could the backup not pick up a partially-written
page? Assuming it's being written to as the backup is in progress. (We
are talking about when disk blocks are smaller than PG blocks here, so
can't guarantee an atomic write for a PG block?)

Any page written during a backup has a backup block that would not be
removable by Koichi's tool, so yes, you'd still be safe.

i.e. between pg_start_backup() and pg_stop_backup() we always use full
page writes, even if you are running in full_page_writes=off mode.

Ah, that's OK then.

--
Richard Huxton
Archonet Ltd

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#15)
Re: [PATCHES] Full page writes improvement, code update

"Simon Riggs" <simon@2ndquadrant.com> writes:

Any page written during a backup has a backup block that would not be
removable by Koichi's tool, so yes, you'd still be safe.

How does it know not to do that?

regards, tom lane

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
Re: [PATCHES] Full page writes improvement, code update

On Fri, 2007-03-30 at 16:35 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

Any page written during a backup has a backup block that would not be
removable by Koichi's tool, so yes, you'd still be safe.

How does it know not to do that?

Not sure what you mean, but I'll take a stab...

I originally questioned Koichi-san's request for a full_page_compress
parameter, which is how it would tell whether/not. After explanation, I
accepted the need for a parameter, but I think we're looking for a new
name for it.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#19Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#18)
Re: [PATCHES] Full page writes improvement, code update

Simon;
Tom;

Koichi is writing.

Your question is how to determine WAL record generated between
pg_start_backup and pg_stop_backup and here's an answer.

XLogInsert( ) already has a logic to determine if inserting WAL record
is between pg_start_backup and pg_stop_backup. Currently it is used
to remove full_page_writes when full_page_writes=off. We can use
this to mark WAL records. We have one bit not used in WAL record
header, the last bit of xl_info, where upper four bits are used to
indicate the resource manager and three of the rest are used to
indicate number of full page writes included in the record.

So in my proposal, this unused bit is used to mark that full page
writes must not be removed at offline optimization by pg_complesslog.

Sorry I didn't have mailing list capability from home and have just
completed my subscription from
home. I had to create new thread to continue my post. Sorry for confusion.

Please refer to the original thread about this discussion.

Best Regards;

--
------
Koichi Suzuki

#20Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#17)
Re: [HACKERS] Full page writes improvement, code update

Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

Any page written during a backup has a backup block that would not be
removable by Koichi's tool, so yes, you'd still be safe.

How does it know not to do that?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

XLogInsert( ) already has a logic to determine if inserting WAL record
is between pg_start_backup and pg_stop_backup. Currently it is used
to remove full_page_writes when full_page_writes=off. We can use
this to mark WAL records. We have one bit not used in WAL record
header, the last bit of xl_info, where upper four bits are used to
indicate the resource manager and three of the rest are used to
indicate number of full page writes included in the record.

In my proposal, this unused bit is used to mark that full page
writes must not be removed at offline optimization by pg_compresslog.

Regards;

--
------
Koichi Suzuki

--
Koichi Suzuki

#21Bruce Momjian
bruce@momjian.us
In reply to: Koichi Suzuki (#10)
#22Koichi Suzuki
koichi.szk@gmail.com
In reply to: Koichi Suzuki (#20)
#23Koichi Suzuki
koichi.szk@gmail.com
In reply to: Bruce Momjian (#21)
#24Bruce Momjian
bruce@momjian.us
In reply to: Koichi Suzuki (#23)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Koichi Suzuki (#23)
#26Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#25)
#27Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#25)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Koichi Suzuki (#9)
#29Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#28)
#30Joshua D. Drake
jd@commandprompt.com
In reply to: Koichi Suzuki (#29)
#31Hannu Krosing
hannu@tm.ee
In reply to: Joshua D. Drake (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Koichi Suzuki (#29)
#33Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#32)
#34Koichi Suzuki
koichi.szk@gmail.com
In reply to: Hannu Krosing (#31)
#35Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#34)
#36Koichi Suzuki
koichi.szk@gmail.com
In reply to: Zeugswetter Andreas SB SD (#35)
#37Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#37)
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#38)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#39)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#41)
#43Koichi Suzuki
koichi.szk@gmail.com
In reply to: Zeugswetter Andreas SB SD (#37)
#44Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#43)
#45Koichi Suzuki
koichi.szk@gmail.com
In reply to: Simon Riggs (#39)
#46Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#42)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Zeugswetter Andreas SB SD (#44)
#48Koichi Suzuki
koichi.szk@gmail.com
In reply to: Zeugswetter Andreas SB SD (#44)
#49Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#48)
#50Josh Berkus
josh@agliodbs.com
In reply to: Simon Riggs (#41)
#51Koichi Suzuki
koichi.szk@gmail.com
In reply to: Josh Berkus (#50)
#52Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Koichi Suzuki (#51)
#53Josh Berkus
josh@agliodbs.com
In reply to: Zeugswetter Andreas SB SD (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#53)
#55Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Josh Berkus (#53)
In reply to: Zeugswetter Andreas SB SD (#55)
#57Josh Berkus
josh@agliodbs.com
In reply to: Zeugswetter Andreas SB SD (#55)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#57)
#59Koichi Suzuki
koichi.szk@gmail.com
In reply to: Zeugswetter Andreas SB SD (#49)
#60Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Josh Berkus (#57)
#61Koichi Suzuki
koichi.szk@gmail.com
In reply to: Josh Berkus (#53)
#62Greg Smith
gsmith@gregsmith.com
In reply to: Zeugswetter Andreas SB SD (#60)
#63Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Greg Smith (#62)
#64Greg Smith
gsmith@gregsmith.com
In reply to: Jim Nasby (#63)
#65Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#58)
#66Bruce Momjian
bruce@momjian.us
In reply to: Koichi Suzuki (#65)
#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Koichi Suzuki (#65)
#68Koichi Suzuki
koichi.szk@gmail.com
In reply to: Tom Lane (#67)
#69Simon Riggs
simon@2ndQuadrant.com
In reply to: Koichi Suzuki (#27)