Slow PITR restore

Started by Joshua D. Drakeabout 18 years ago69 messages
#1Joshua D. Drake
jd@commandprompt.com

Hello,

I have an interesting thing going on here...

2007-12-11 21:10:36 EST LOG: restored log file
"00000001000007CC00000012" from archive
2007-12-11 21:10:47 EST LOG: restored log file
"00000001000007CC00000013" from archive
2007-12-11 21:11:02 EST LOG: restored log file
"00000001000007CC00000014" from archive
2007-12-11 21:11:14 EST LOG: restored log file
"00000001000007CC00000015" from archive
2007-12-11 21:11:26 EST LOG: restored log file
"00000001000007CC00000016" from archive
2007-12-11 21:11:39 EST LOG: restored log file
"00000001000007CC00000017" from archive
2007-12-11 21:11:54 EST LOG: restored log file
"00000001000007CC00000018" from archive
2007-12-11 21:12:13 EST LOG: restored log file
"00000001000007CC00000019" from archive
2007-12-11 21:12:27 EST LOG: restored log file
"00000001000007CC0000001A" from archive
2007-12-11 21:12:42 EST LOG: restored log file
"00000001000007CC0000001B" from archive
2007-12-11 21:12:54 EST LOG: restored log file
"00000001000007CC0000001C" from archive
2007-12-11 21:13:08 EST LOG: restored log file
"00000001000007CC0000001D" from archive
2007-12-11 21:13:21 EST LOG: restored log file
"00000001000007CC0000001E" from archive
2007-12-11 21:13:36 EST LOG: restored log file
"00000001000007CC0000001F" from archive
2007-12-11 21:13:45 EST LOG: restored log file
"00000001000007CC00000020" from archive
2007-12-11 21:13:52 EST LOG: restored log file
"00000001000007CC00000021" from archive
2007-12-11 21:14:01 EST LOG: restored log file
"00000001000007CC00000022" from archive

Now I understand that restoring log files can be slow but this is a big
machine.

32Gig of Ram
28 Disks (two volumes each raid 10)
Using two different table spaces (over each volume)
8 Opterons.

The machine has a production configuration that can doing several
hundred thousand transactions an hour without so much as having to
stretch first. It seems to me that this machine should be *cranking*
through these transactions. Am I just being my usual uptight, impatient
self?

Sincerely,

Joshua D. Drake

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#1)
Re: Slow PITR restore

"Joshua D. Drake" <jd@commandprompt.com> writes:

... Now I understand that restoring log files can be slow but this is a big
machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

Heikki has improved matters a bit in 8.3, but I suspect it's still
not great.

regards, tom lane

#3Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#2)
Re: Slow PITR restore

Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

... Now I understand that restoring log files can be slow but this is a big
machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

Heikki has improved matters a bit in 8.3, but I suspect it's still
not great.

Wow, o.k. well it is something we (the community) really should look at
for 8.4. I am surprised that it is slower than just walking through the
xlogs on recovery. I am sure there is a reason just surprised.

Thanks for answering.

Sincerely,

Joshua D. Drake

Show quoted text

regards, tom lane

#4Gregory Stark
stark@enterprisedb.com
In reply to: Joshua D. Drake (#3)
Re: Slow PITR restore

"Joshua D. Drake" <jd@commandprompt.com> writes:

Wow, o.k. well it is something we (the community) really should look at for
8.4. I am surprised that it is slower than just walking through the xlogs on
recovery. I am sure there is a reason just surprised.

Well in the worst case it has to do nearly as much work as the original
database did. And it only gets to use 1 cpu so it can only have one i/o
request pending.

bgwriter is started already when doing recovery, right? Perhaps things could
be helped by telling bgwriter to behave differently during recovery.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Gregory Stark (#4)
Re: Slow PITR restore

On Wed, 2007-12-12 at 08:55 +0000, Gregory Stark wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Wow, o.k. well it is something we (the community) really should look at for
8.4. I am surprised that it is slower than just walking through the xlogs on
recovery. I am sure there is a reason just surprised.

It's the same speed because it is the same recovery code.

Well in the worst case it has to do nearly as much work as the original
database did. And it only gets to use 1 cpu so it can only have one i/o
request pending.

That need only be slow in certain circumstances.

bgwriter is started already when doing recovery, right? Perhaps things could
be helped by telling bgwriter to behave differently during recovery.

It would certainly help if bgwriter came up earlier. I've been looking
at that just recently. The main issue is how to structure the code to
tell bgwriter when it can start processing in recovery mode and then
move into normal mode.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#6Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Gregory Stark (#4)
Re: Slow PITR restore

Gregory Stark wrote:

bgwriter is started already when doing recovery, right? Perhaps things could
be helped by telling bgwriter to behave differently during recovery.

No.

--
Alvaro Herrera http://www.flickr.com/photos/alvherre/
"Amanece. (Ignacio Reyes)
El Cerro San Crist�bal me mira, c�nicamente, con ojos de virgen"

#7Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Slow PITR restore

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

... Now I understand that restoring log files can be slow but this is a big
machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Simon Riggs (#7)
Re: Slow PITR restore

Simon Riggs wrote:

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

... Now I understand that restoring log files can be slow but this is a big
machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

Sheesh Simon you are really pushing this release :). I would love to see
a patch to resolve this, especially since it appears to be a fairly
glaring oversight. We can't really expect people to use PITR if they new
it would take hours to recover even on the size of machine I was working on.

On the other hand... we are about to go to RC1 :)

Joshua D. Drake

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Joshua D. Drake (#8)
Re: Slow PITR restore

On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:

We can't really expect people to use PITR if they new
it would take hours to recover even on the size of machine I was working on.

That's not true statement in all cases and can often be improved with
some monitoring and tuning. Just get your client to call me :-)

Are you doing replication, or a PITR for another reason?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Simon Riggs (#9)
Re: Slow PITR restore

Simon Riggs wrote:

On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:

We can't really expect people to use PITR if they new
it would take hours to recover even on the size of machine I was working on.

That's not true statement in all cases and can often be improved with
some monitoring and tuning. Just get your client to call me :-)

Uhh.. right.

Are you doing replication, or a PITR for another reason?

Warm standby. Normally we pull every 5 minutes which is why we hadn't
noticed this before. However last night we pulled a full sync and
recover and that is when we noticed it.

8 seconds for a single archive recovery is very slow in consideration of
this machine. Even single threaded that seems slow.

Sincerely,

Joshua D. Drake

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#7)
Re: Slow PITR restore

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

Good grief, no. We have not even done the research to find out where
the bottleneck(s) is/are. We're not holding up 8.3 while we go back
into development mode, especially not when this problem has existed
for seven or eight years (even if JD failed to notice before) and
there are already some improvements for it in 8.3.

regards, tom lane

#12Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#11)
Re: Slow PITR restore

Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

Good grief, no. We have not even done the research to find out where
the bottleneck(s) is/are. We're not holding up 8.3 while we go back
into development mode, especially not when this problem has existed
for seven or eight years (even if JD failed to notice before) and
there are already some improvements for it in 8.3.

I would also note that this "problem" is only going to be noticeable on
the highest velocity of databases. This is certainly a 10% of the users
issue. It would be great to get it fixed but there are ways around it
(namely making sure you are running pg_standby and pushing logs at
smaller intervals).

Sincerely,

Joshua D. Drake

Show quoted text

regards, tom lane

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#11)
Re: Slow PITR restore

On Wed, 2007-12-12 at 12:13 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

Good grief, no.

Understood.

Just checking to see if you were fishing for a patch from me, so I'm
relieved actually.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#14Jeff Trout
threshar@torgo.978.org
In reply to: Joshua D. Drake (#10)
Re: Slow PITR restore

On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:

8 seconds for a single archive recovery is very slow in
consideration of this machine. Even single threaded that seems slow.

I've seen this on my PITR restores (thankfully, they were for
fetching some old data, not because we expoded). On a 2.4ghz opteron
it took 5-50 seconds per wal segment, and there were a LOT of
segments (replay took hours and hours). I asked a few folks and was
told it is the nature of the beast. Hopefully something in 8.4 can
be done.

--
Jeff Trout <jeff@jefftrout.com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Trout (#14)
Re: Slow PITR restore

Jeff Trout <threshar@threshar.is-a-geek.com> writes:

I've seen this on my PITR restores (thankfully, they were for
fetching some old data, not because we expoded). On a 2.4ghz opteron
it took 5-50 seconds per wal segment, and there were a LOT of
segments (replay took hours and hours). I asked a few folks and was
told it is the nature of the beast. Hopefully something in 8.4 can
be done.

Before we get all panicked about that, someone should try to measure the
restore speed on 8.3. It's possible that this patch already
alleviated the problem:
http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

regards, tom lane

#16Gregory Stark
stark@enterprisedb.com
In reply to: Jeff Trout (#14)
Re: Slow PITR restore

"Jeff Trout" <threshar@torgo.978.org> writes:

On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:

8 seconds for a single archive recovery is very slow in consideration of
this machine. Even single threaded that seems slow.

I've seen this on my PITR restores (thankfully, they were for fetching some
old data, not because we expoded). On a 2.4ghz opteron it took 5-50 seconds
per wal segment,

I'm not sure what you guys' expectations are, but if you're restoring 5
minutes worth of database traffic in 8 seconds I wouldn't be complaining.

Depending on your transaction mix and what percentage of it is read-only
select queries you might reasonably expect the restore to take as long as it
took to generate them...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#17Joshua D. Drake
jd@commandprompt.com
In reply to: Gregory Stark (#16)
Re: Slow PITR restore

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 18:02:39 +0000
Gregory Stark <stark@enterprisedb.com> wrote:

I'm not sure what you guys' expectations are, but if you're restoring
5 minutes worth of database traffic in 8 seconds I wouldn't be
complaining.

I would be. This is a database that is doing nothing but restoring.
Zero concurrency. This thing should be flying.

Depending on your transaction mix and what percentage of it is
read-only select queries you might reasonably expect the restore to
take as long as it took to generate t

We archive selects?

Joshua D. Drake

- --
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCOIATb/zqfZUUQRAlbBAJ0aNmWy8xFKtmaONUmL9LegTfaaywCfSoS3
TRmDPUEhoTJJVSO8REM6YYI=
=bv/t
-----END PGP SIGNATURE-----

#18Jeff Trout
threshar@torgo.978.org
In reply to: Gregory Stark (#16)
Re: Slow PITR restore

On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:

I'm not sure what you guys' expectations are, but if you're
restoring 5
minutes worth of database traffic in 8 seconds I wouldn't be
complaining.

Depending on your transaction mix and what percentage of it is read-
only
select queries you might reasonably expect the restore to take as
long as it
took to generate them...

in this case it was 24hrs of data - about 1500 wal segments. During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

I'll have to artifically build up some db traffic on an 8.3 instance
to see if the patch Tom mentioned helps.

--
Jeff Trout <jeff@jefftrout.com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/

#19Simon Riggs
simon@2ndquadrant.com
In reply to: Joshua D. Drake (#17)
Re: Slow PITR restore

On Wed, 2007-12-12 at 10:08 -0800, Joshua D. Drake wrote:

Depending on your transaction mix and what percentage of it is
read-only select queries you might reasonably expect the restore to
take as long as it took to generate t

We archive selects?

No, but that is exactly his point.

Whatever proportion of the workload is selects will *not* have to be
reproduced on the client.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#20Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#15)
Re: Slow PITR restore

On Wed, 2007-12-12 at 12:56 -0500, Tom Lane wrote:

Jeff Trout <threshar@threshar.is-a-geek.com> writes:

I've seen this on my PITR restores (thankfully, they were for
fetching some old data, not because we expoded). On a 2.4ghz opteron
it took 5-50 seconds per wal segment, and there were a LOT of
segments (replay took hours and hours). I asked a few folks and was
told it is the nature of the beast. Hopefully something in 8.4 can
be done.

Before we get all panicked about that, someone should try to measure the
restore speed on 8.3. It's possible that this patch already
alleviated the problem:
http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

Twas this feature, amongst others, that I had asked to be put back onto
the release notes, so people knew it had improved. I don't really want
to replicate that discussion, but if we're talking about it here then it
probably should be in the notes.

It roughly doubles performance, but there's still more required in some
cases. My recommendation is always measure the performance first.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#21Joshua D. Drake
jd@commandprompt.com
In reply to: Jeff Trout (#18)
Re: Slow PITR restore

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 13:13:35 -0500
Jeff Trout <threshar@threshar.is-a-geek.com> wrote:

On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:

I'm not sure what you guys' expectations are, but if you're
restoring 5
minutes worth of database traffic in 8 seconds I wouldn't be
complaining.

Depending on your transaction mix and what percentage of it is
read- only
select queries you might reasonably expect the restore to take as
long as it
took to generate them...

in this case it was 24hrs of data - about 1500 wal segments. During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

Tom is also correct, we should test this on 8.3.

Joshua D. Drake

- --
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCerATb/zqfZUUQRArdeAJ9D89Qi7xCqFDUOpUgKQ/QigwHNPwCdFQfN
Dl8svUbMi40WExyd93MCIzw=
=MEhU
-----END PGP SIGNATURE-----

#22Gregory Stark
stark@enterprisedb.com
In reply to: Joshua D. Drake (#17)
Re: Slow PITR restore

"Joshua D. Drake" <jd@commandprompt.com> writes:

On Wed, 12 Dec 2007 18:02:39 +0000
Gregory Stark <stark@enterprisedb.com> wrote:

I'm not sure what you guys' expectations are, but if you're restoring
5 minutes worth of database traffic in 8 seconds I wouldn't be
complaining.

I would be. This is a database that is doing nothing but restoring.
Zero concurrency. This thing should be flying.

Well you say that like concurrency is a bad thing. The lack of concurrency is
the big handicap recovery has. It has to wait while it loads one buffer so it
can twiddle some bits before it reads the next buffer and twiddles bits there.
During normal operation those two buffers were twiddled by two different
transactions in two different processes. Even if they weren't on two different
processes they could have been context switched onto the same processor while
the i/o was in progress.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#23Joshua D. Drake
jd@commandprompt.com
In reply to: Gregory Stark (#22)
Re: Slow PITR restore

Gregory Stark wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

On Wed, 12 Dec 2007 18:02:39 +0000
Gregory Stark <stark@enterprisedb.com> wrote:

I'm not sure what you guys' expectations are, but if you're restoring
5 minutes worth of database traffic in 8 seconds I wouldn't be
complaining.

I would be. This is a database that is doing nothing but restoring.
Zero concurrency. This thing should be flying.

Well you say that like concurrency is a bad thing. The lack of concurrency is
the big handicap recovery has. It has to wait while it loads one buffer so it
can twiddle some bits before it reads the next buffer and twiddles bits there.
During normal operation those two buffers were twiddled by two different
transactions in two different processes. Even if they weren't on two different
processes they could have been context switched onto the same processor while
the i/o was in progress.

Please see my point about saturation in another post.

Joshua D. Drake

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#21)
Re: Slow PITR restore

"Joshua D. Drake" <jd@commandprompt.com> writes:

Jeff Trout <threshar@threshar.is-a-geek.com> wrote:

in this case it was 24hrs of data - about 1500 wal segments. During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high, running pgbench for awhile, and
then killing the bgwriter to force a recovery cycle over all the WAL
generated by the pgbench run. What I saw was that the machine was 100%
disk write bound. Increasing shared_buffers helped, not in that the
write rate got less according to vmstat, but the completion time did.

shared_buffers 32MB 100MB

pgbench -c 5 -t 40000 bench 7m23s 2m20s
subsequent recovery 4m26s 2m21s

Typical "vmstat 1" lines during recovery:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0

I don't see the machine sitting around doing nothing ...

regards, tom lane

#25Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#24)
Re: Slow PITR restore

Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Jeff Trout <threshar@threshar.is-a-geek.com> wrote:

in this case it was 24hrs of data - about 1500 wal segments. During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high,

Well I haven't tested CVS HEAD yet, but on 8.2:

avg-cpu: %user %nice %sys %iowait %idle
0.12 0.00 0.30 12.22 87.35

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 172.26 2927.74 46.31 14668 232
sda1 182.63 2927.74 46.31 14668 232
sdb 1.40 0.00 41.52 0 208
sdb1 10.38 0.00 41.52 0 208
sdb2 0.00 0.00 0.00 0 0
sdc 96.21 946.11 14.37 4740 72
sdc1 99.20 946.11 14.37 4740 72
sdd 0.00 0.00 0.00 0 0
sdd1 0.00 0.00 0.00 0 0

And:

2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive
2007-12-13 00:55:34 EST LOG: restored log file "00000001000007E10000006C" from archive
2007-12-13 00:55:48 EST LOG: restored log file "00000001000007E10000006D" from archive
2007-12-13 00:56:01 EST LOG: restored log file "00000001000007E10000006E" from archive
2007-12-13 00:56:14 EST LOG: restored log file "00000001000007E10000006F" from archive
2007-12-13 00:56:28 EST LOG: restored log file "00000001000007E100000070" from archive
2007-12-13 00:56:42 EST LOG: restored log file "00000001000007E100000071" from archive

This is a 14 Spindles in RAID 10 on sda and sdc. As a correlation, exact
same hardware (and OS) and off peak production load of 2.25 million
xacts per hour (at data point), 3million per hour at peak.:

avg-cpu: %user %nice %sys %iowait %idle
23.30 0.00 15.28 33.07 28.35

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 2078.80 23693.60 10628.00 118468 53140
sda1 4139.80 23693.60 10628.00 118468 53140
sdb 24.00 94.40 318.40 472 1592
sdb1 87.40 94.40 318.40 472 1592
sdc 694.00 22207.20 4113.60 111036 20568
sdc1 1575.00 22207.20 4113.60 111036 20568
sdd 826.80 3.20 8691.20 16 43456
sdd1 828.60 3.20 8691.20 16 43456

I would expect that recovery would be faster than prod. Maybe I am off
my rocker but the warm standby isn't doing anything but restoring the logs.

This is going to be hard to test with 8.3 with this data set but I will
see if I can get some other numbers with 8.3.

Sincerely,

Joshua D. Drake

#26Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#24)
Re: Slow PITR restore

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high, running pgbench for awhile, and
then killing the bgwriter to force a recovery cycle over all the WAL
generated by the pgbench run. What I saw was that the machine was 100%
disk write bound. Increasing shared_buffers helped, not in that the
write rate got less according to vmstat, but the completion time did.

There are at least three definitions of "saturating the I/O" and it sounds
like you two are using two different ones.

1) The processor is waiting on I/O all the time
2) The hard drives are all always handling a request
3) The hard drives are all handling the full bandwidth they're capable

You would expect (1) and (2) to be the same for a single drive -- though in
practice there seems to be a gap even between them. But for a raid array there
can be a large difference, and the wider the raid stripe the larger the
difference.

In Tom's results:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0

I don't see the machine sitting around doing nothing ...

Note that even though the processor is 99% in wait state the drive is only
handling about 3 MB/s. That translates into a seek time of 2.2ms which is
actually pretty fast. So if this is a single drive (1) and (2) seem to be
pretty much the same here.

But note that if this were a raid array Postgres's wouldn't be getting any
better results. A Raid array wouldn't improve i/o latency at all and since
it's already 99% waiting for i/o Postgres is not going to be able to issue any
more. But only one drive in the raid array will be busy at a time which would
be far less than the maximum random access i/o the raid array is capable of.

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

That's still going to be a far cry from the maximum bandwidth the hard drive
can handle. Even that single drive can probably handle 60MB/s sequential I/O.
That's probably the source of the unrealistic expectations people have. It's
easy to watch the bandwidth number as the headline number to measure i/o
utilization.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#25)
Re: Slow PITR restore

"Joshua D. Drake" <jd@commandprompt.com> writes:

Tom Lane wrote:

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high,

... And:

2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive

Hmm --- I was testing a straight crash-recovery scenario, not restoring
from archive. Are you sure your restore_command script isn't
responsible for a lot of the delay?

regards, tom lane

#28Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#27)
Re: Slow PITR restore

Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Tom Lane wrote:

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high,

... And:

2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive

Hmm --- I was testing a straight crash-recovery scenario, not restoring
from archive. Are you sure your restore_command script isn't
responsible for a lot of the delay?

Now that's an interesting thought, I will review in the morning when I
have some more IQ points back.

Joshua D. Drake

Show quoted text

regards, tom lane

#29Simon Riggs
simon@2ndquadrant.com
In reply to: Gregory Stark (#26)
Re: Slow PITR restore

On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high, running pgbench for awhile, and
then killing the bgwriter to force a recovery cycle over all the WAL
generated by the pgbench run. What I saw was that the machine was 100%
disk write bound. Increasing shared_buffers helped, not in that the
write rate got less according to vmstat, but the completion time did.

There are at least three definitions of "saturating the I/O" and it sounds
like you two are using two different ones.

1) The processor is waiting on I/O all the time
2) The hard drives are all always handling a request
3) The hard drives are all handling the full bandwidth they're capable

You would expect (1) and (2) to be the same for a single drive -- though in
practice there seems to be a gap even between them. But for a raid array there
can be a large difference, and the wider the raid stripe the larger the
difference.

In Tom's results:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0

I don't see the machine sitting around doing nothing ...

Note that even though the processor is 99% in wait state the drive is only
handling about 3 MB/s. That translates into a seek time of 2.2ms which is
actually pretty fast. So if this is a single drive (1) and (2) seem to be
pretty much the same here.

But note that if this were a raid array Postgres's wouldn't be getting any
better results. A Raid array wouldn't improve i/o latency at all and since
it's already 99% waiting for i/o Postgres is not going to be able to issue any
more. But only one drive in the raid array will be busy at a time which would
be far less than the maximum random access i/o the raid array is capable of.

Agree with Greg's analysis here. Moving to -hackers now.

I've done performance profiling also. My results replicated Tom's, but I
hadn't performed them on a big enough system and so didn't realise the
I/O scalability issue could be such a large problem. Koichi showed me
some results on a much larger server that illustrated the I/O problem.

But lets remember its only a problem on large servers with a heavy write
workload and a large random I/O requirement. That's an important set of
people, but much less than Josh's 10% of people even.

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Florian's code for Hot Standby introduces a separate recovery process,
similar to an autovacuum launcher. I propose a mechanism similar to the
AV solution where we have lots of recovery workers, with one recovery
master reading the WAL files. We can then distribute WAL records to
workers in some manner.

It's true that many WAL records depend upon each other, but its also
true that the performance problems only occur in the situation when they
the WAL records don't depend upon each other. If they did, they would
touch the same blocks and it would be cached. So as long as we have a
safe mechanism for splitting up the work, everything is fine.

We can divide up the WAL records this by looking at the rmgr field, plus
looking deeper into the records themselves so we can touch different
relations/blocks.

So I'm planning to review this *after* Florian has introduced his patch,
so we can build upon it.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#30Gregory Stark
stark@enterprisedb.com
In reply to: Simon Riggs (#29)
Re: Slow PITR restore

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

#31Simon Riggs
simon@2ndquadrant.com
In reply to: Gregory Stark (#30)
Re: [GENERAL] Slow PITR restore

On Thu, 2007-12-13 at 09:45 +0000, Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise.

Some handwaving...

ISTM its just autovacuum launcher + Hot Standby mixed.

I guess I've added two new backends now (Tom groans...) so that part
seems very straightforward. The harder part is distributing the work and
the hardest part is doing that evenly enough to make a difference. It
will also require rmgr changes for state handling, but then I think that
needs work anyway. Maybe we don't even need a master.

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete.

But it's also more ambitious.

Time is the issue, I think, so you may be right. That's always why I
grunt so much about freeze dates.

Anyway, I'll leave this now, since I think we need to do Florian's work
first either way and that is much more eagerly awaited I think.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#32Gregory Stark
stark@enterprisedb.com
In reply to: Simon Riggs (#31)
Re: [GENERAL] Slow PITR restore

"Simon Riggs" <simon@2ndquadrant.com> writes:

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete.

Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#33Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Gregory Stark (#32)
Re: [GENERAL] Slow PITR restore

Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete.

Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lock
on the buffer. According to the comments there, it wouldn't be strictly
necessary. But I believe we do actually need it to protect from
bgwriter writing out a buffer while it's been modified. We only lock one
page at a time, which is good enough for WAL replay, but not to protect
things like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise in
recovery a while ago. First of all, there's the changes to the buffer
manager, which we'd need anyway if we wanted to use posix_fadvise for
speeding up other stuff like index scans. Then there's changes to
xlog.c, to buffer a number of WAL records, so that you can read ahead
the data pages needed by WAL records ahead of the WAL record you're
actually replaying.

I added a new function, readahead, to the rmgr API. It's similar to the
redo function, but it doesn't actually replay the WAL record, but just
issues the fadvise calls to the buffer manager for the pages that are
needed to replay the WAL record. This needs to be implemented for each
resource manager that we want to do readahead for. If we had the list of
blocks in the WAL record in a rmgr-independent format, we could do that
in a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need a
lot of bookkeeping to keep the processes from stepping on each others
toes, and to choose the next WAL record to replay. I think you have the
same problem that you need to have a rmgr-specific function to extract
the data blocks #s required to replay that WAL record, or add that list
to the WAL record header in a generic format. Multi-process approach is
nice because it allows you to parallelize the CPU work of replaying the
records as well, but I wonder how much that really scales given all the
locking required. Also, I don't think replaying WAL records is very
expensive CPU-wise. You'd need a pretty impressive RAID array to read
WAL from, to saturate a single CPU.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#34Guillaume Smet
guillaume.smet@gmail.com
In reply to: Simon Riggs (#31)
Re: [GENERAL] Slow PITR restore

Simon,

On Dec 13, 2007 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Anyway, I'll leave this now, since I think we need to do Florian's work
first either way and that is much more eagerly awaited I think.

Speaking of that, is there any news about it and about Florian? It was
a really promising work.

Thanks.

--
Guillaume

#35Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Gregory Stark (#30)
Re: Slow PITR restore

Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

Actually, if you are modifying the same blocks over and over it will
help *less*, because applying each record needs to occur only after the
previous records that modify the same block have been applied.

So you have two possibilities: you skip that record and try to apply the
next one, hoping that that record applies to a block that's not locked,
(which means you have to remember the skipped record and apply it
sometime in the future), or you put the process to sleep until the lock
has been released.

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La conclusi�n que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusi�n de ellos" (Tanenbaum)

#36Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Simon Riggs (#31)
Re: [GENERAL] Slow PITR restore

Simon Riggs wrote:

ISTM its just autovacuum launcher + Hot Standby mixed.

I don't think you need a launcher at all. Just get the postmaster to
start a configurable number of wal-replay processes (currently the
number is hardcoded to 1).

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude." (Brian Kernighan)

#37Simon Riggs
simon@2ndquadrant.com
In reply to: Heikki Linnakangas (#33)
Re: [GENERAL] Slow PITR restore

On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote:

Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete.

Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lock
on the buffer. According to the comments there, it wouldn't be strictly
necessary. But I believe we do actually need it to protect from
bgwriter writing out a buffer while it's been modified. We only lock one
page at a time, which is good enough for WAL replay, but not to protect
things like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise in
recovery a while ago. First of all, there's the changes to the buffer
manager, which we'd need anyway if we wanted to use posix_fadvise for
speeding up other stuff like index scans. Then there's changes to
xlog.c, to buffer a number of WAL records, so that you can read ahead
the data pages needed by WAL records ahead of the WAL record you're
actually replaying.

I added a new function, readahead, to the rmgr API. It's similar to the
redo function, but it doesn't actually replay the WAL record, but just
issues the fadvise calls to the buffer manager for the pages that are
needed to replay the WAL record. This needs to be implemented for each
resource manager that we want to do readahead for. If we had the list of
blocks in the WAL record in a rmgr-independent format, we could do that
in a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need a
lot of bookkeeping to keep the processes from stepping on each others
toes, and to choose the next WAL record to replay. I think you have the
same problem that you need to have a rmgr-specific function to extract
the data blocks #s required to replay that WAL record, or add that list
to the WAL record header in a generic format. Multi-process approach is
nice because it allows you to parallelize the CPU work of replaying the
records as well, but I wonder how much that really scales given all the
locking required. Also, I don't think replaying WAL records is very
expensive CPU-wise. You'd need a pretty impressive RAID array to read
WAL from, to saturate a single CPU.

With all this talk, I thought of a much better way. We don't actually
need to apply the changes in the order they are received, we just need
to apply sufficient ordering to ensure that each block's changes are
applied in LSN order.

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

That sounds like it might not be I/O efficient, in that it would suffer
from producer/consumer flip/flopping. But with large main work mem
you'll get all the I/O from possibly hundreds of WAL files all
accumulated together before it is issued - assuming only a small % of
WAL records go into the cache and then many of those will have their I/O
reduced to zero because of the sequential cache access.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#38Simon Riggs
simon@2ndquadrant.com
In reply to: Alvaro Herrera (#35)
Re: [HACKERS] Slow PITR restore

On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote:

Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

Actually, if you are modifying the same blocks over and over it will
help *less*, because applying each record needs to occur only after the
previous records that modify the same block have been applied.

So you have two possibilities: you skip that record and try to apply the
next one, hoping that that record applies to a block that's not locked,
(which means you have to remember the skipped record and apply it
sometime in the future), or you put the process to sleep until the lock
has been released.

Ah, OK, I can see we're on the same lines of thought there. Just posted
a reply to Heikki about this sort of idea.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#39Joshua D. Drake
jd@commandprompt.com
In reply to: Joshua D. Drake (#28)
Re: Slow PITR restore

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 23:08:35 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

Tom Lane wrote:

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high,

... And:

2007-12-13 00:55:20 EST LOG: restored log file
"00000001000007E10000006B" from archive

Hmm --- I was testing a straight crash-recovery scenario, not
restoring from archive. Are you sure your restore_command script
isn't responsible for a lot of the delay?

Now that's an interesting thought, I will review in the morning when
I have some more IQ points back.

As promised :)... I took a look at this today and I think I found a
couple of things. It appears that once the logs are archived, the
recovery command copies the archive file to a recovery location and
then restores the file.

If that is correct that could explain some of the latency I am seeing
here. Even with the speed of these devices, it is still a 16 MB file.
That could take 1-2 seconds to copy.

There is also the execution of pg_standby each time as the recovery
command which although I haven't timed is going to add overhead.

Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4
and then another 6 seconds.

When are fsyncs called on the recovery process?

At these types of delays even speeding the process 2 seconds per log is
going to be significant.

Sincerely,

Joshua D. Drake

- --
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYYQaATb/zqfZUUQRAiiNAKCNDaO+MYDDLM/lUbL4D9Q9NIEyRQCgqhye
cJ2PAv9rEzAi/jDFPzzoFNw=
=xNMz
-----END PGP SIGNATURE-----

#40Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Simon Riggs (#37)
Re: [GENERAL] Slow PITR restore

Simon Riggs wrote:

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going
to be issuing only one read request at a time. The order of those
requests will be better from cache hit point of view, which is good, but
the problem remains that if the modified data blocks are scattered
around the database, you'll be doing random I/O, one request at a time.

It would be interesting to do something like that to speed up replay of
long PITR archives, though. You could scan all (or at least far ahead)
the WAL records, and make note of where there is full page writes for
each page. Whenever there's a full page write further ahead in the log,
you could ignore all changes to that page before that, because they're
going to be overwritten anyway. It won't help with normal recovery,
because there won't be more than one full page image of each page after
the last checkpoint, but with PITR it would help.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a
given WAL record will need to access. For multiple recovery processes,
you need that to figure out which WAL records you can safely replay. In
the posix_fadvise scheme, you need that to issue the posix_fadvises
without modifying anything.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#41Simon Riggs
simon@2ndquadrant.com
In reply to: Heikki Linnakangas (#40)
Re: [GENERAL] Slow PITR restore

On Thu, 2007-12-13 at 20:25 +0000, Heikki Linnakangas wrote:

Simon Riggs wrote:

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

Agreed, though I was imagining to use the cache as a secondary hash
table. Not sure about that now I write it. I think the accumulation idea
mostly makes sense for heaps. Indexes look much harder.

Whatever happens I think we should get Florian's work in there first,
then tune.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going
to be issuing only one read request at a time. The order of those
requests will be better from cache hit point of view, which is good, but
the problem remains that if the modified data blocks are scattered
around the database, you'll be doing random I/O, one request at a time.

Yeh, agreed. That's why I went for the parallelism approach originally:
you can't escape the basic physics.

I've re-read your post. If you think the buffer manager changes are
roughly the same as would be needed for other gains on index scans, then
your async I/O seems like the most profitable approach. I still don't
like it as much, but that aspect tips the balance, I think.

It would be interesting to do something like that to speed up replay of
long PITR archives, though. You could scan all (or at least far ahead)
the WAL records, and make note of where there is full page writes for
each page. Whenever there's a full page write further ahead in the log,
you could ignore all changes to that page before that, because they're
going to be overwritten anyway. It won't help with normal recovery,
because there won't be more than one full page image of each page after
the last checkpoint, but with PITR it would help.

Another good idea.

Of course if we scan that far ahead we can start removing aborted
transactions also, which is the more standard optimization of recovery.

I was imagining we would just memory map the files rather than buffer
them explicitly, BTW.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a
given WAL record will need to access. For multiple recovery processes,
you need that to figure out which WAL records you can safely replay. In
the posix_fadvise scheme, you need that to issue the posix_fadvises
without modifying anything.

Yeh, I think it should be easy enough to group together the block-based
rmgrs so they all have the same basic structure. Heap and the indexes,
that is, but some parts are harder than others. My feeling is that heaps
will easily accumulate, though secondary indexes will often be random.

Incidentally, HOT will speed up recovery also, since there will be fewer
index operations to replay.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#42Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#41)
Re: [GENERAL] Slow PITR restore

On Thu, 2007-12-13 at 21:13 +0000, Simon Riggs wrote:

Of course if we scan that far ahead we can start removing aborted
transactions also, which is the more standard optimization of
recovery.

Recall that thought!

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#40)
Re: [GENERAL] Slow PITR restore

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

A general comment on this thread: the idea of putting any sort of
asynchronous behavior into WAL recovery gives me the willies.

Recovery is inherently one of the least-exercised parts of the system,
and it gets more so with each robustness improvement we make elsewhere.
Moreover, because it is fairly dumb, anything that does go wrong will
likely result in silent data corruption that may not be noted until much
later. Any bugs we introduce into recovery will be very hard to find
... and timing-dependent ones will be damn near impossible.

So in my mind the watchword has got to be KISS. If that means that
recovery isn't terribly speedy, so be it. I'd far rather get the
right answer slower.

Also, I have not seen anyone provide a very credible argument why
we should spend a lot of effort on optimizing a part of the system
that is so little-exercised. Don't tell me about warm standby
systems --- they are fine as long as recovery is at least as fast
as the original transactions, and no evidence has been provided to
suggest that it's not.

regards, tom lane

#44Greg Smith
gsmith@gregsmith.com
In reply to: Gregory Stark (#26)
Re: [GENERAL] Slow PITR restore

On Thu, 13 Dec 2007, Gregory Stark wrote:

Note that even though the processor is 99% in wait state the drive is
only handling about 3 MB/s. That translates into a seek time of 2.2ms
which is actually pretty fast...But note that if this were a raid array
Postgres's wouldn't be getting any better results. A Raid array wouldn't
improve i/o latency at all and since it's already 99% waiting for i/o
Postgres is not going to be able to issue any more.

If it's a straight stupid RAID array, sure. But when you introduce a good
write caching controller into the mix, that can batch multiple writes,
take advantage of more elevator sorting, and get more writes/seek
accomplished. Combine that improvement with having multiple drives as
well and the PITR performance situation becomes very different; you really
can get more than one drive in the array busy at a time. It's also true
that you won't see everything that's happening with vmstat because the
controller is doing the low-level dispatching.

I'll try to find time to replicate the test Tom suggested, as I think my
system is about middle ground between his and Joshua's. In general I've
never been able to get any interesting write throughput testing at all
without at least a modest caching controller in there. Just like Tom's
results, with a regular 'ole drive everything gets seek bottlenecked, WIO
goes high, and it looks like I've got all the CPU in the world. I run a
small Areca controller with 3 drives on it (OS+DB+WAL) at home to at least
get close to a real server.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#45Joshua D. Drake
jd@commandprompt.com
In reply to: Joshua D. Drake (#39)
Re: [GENERAL] Slow PITR restore

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 13 Dec 2007 11:12:26 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

Hmm --- I was testing a straight crash-recovery scenario, not
restoring from archive. Are you sure your restore_command script
isn't responsible for a lot of the delay?

Now that's an interesting thought, I will review in the morning when
I have some more IQ points back.

As promised :)... I took a look at this today and I think I found a
couple of things. It appears that once the logs are archived, the
recovery command copies the archive file to a recovery location and
then restores the file.

If that is correct that could explain some of the latency I am seeing
here. Even with the speed of these devices, it is still a 16 MB file.
That could take 1-2 seconds to copy.

There is also the execution of pg_standby each time as the recovery
command which although I haven't timed is going to add overhead.

Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4
and then another 6 seconds.

When are fsyncs called on the recovery process?

At these types of delays even speeding the process 2 seconds per log
is going to be significant.

Sincerely,

Joshua D. Drake

- --
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYaoyATb/zqfZUUQRAozEAJ94sm3gdhPB0dcHfBD4uIs6cKHB4ACeK8dj
Wh9Jw2N3Ac29ELPaPZJL/+w=
=7edj
-----END PGP SIGNATURE-----

#46Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#43)
Re: [GENERAL] Slow PITR restore

On Thu, 2007-12-13 at 16:41 -0500, Tom Lane wrote:

Recovery is inherently one of the least-exercised parts of the system,
and it gets more so with each robustness improvement we make elsewhere.
Moreover, because it is fairly dumb, anything that does go wrong will
likely result in silent data corruption that may not be noted until much
later. Any bugs we introduce into recovery will be very hard to find
... and timing-dependent ones will be damn near impossible.

So in my mind the watchword has got to be KISS. If that means that
recovery isn't terribly speedy, so be it. I'd far rather get the
right answer slower.

Very much agreed, and really the real reason the main recovery code is
essentially untouched for so long. That thought was #1 priority when
writing PITR. Thanks for reminding me/us.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#47Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Tom Lane (#43)
Re: [GENERAL] Slow PITR restore

Tom Lane wrote:

Also, I have not seen anyone provide a very credible argument why
we should spend a lot of effort on optimizing a part of the system
that is so little-exercised. Don't tell me about warm standby
systems --- they are fine as long as recovery is at least as fast
as the original transactions, and no evidence has been provided to
suggest that it's not.

Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in
May. They had setup two identical systems, one running the benchmark,
and another one as a warm stand-by. The stand-by couldn't keep up; it
couldn't replay the WAL as quickly as the primary server produced it.
IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

It sounds unbelievable at first, but the problem is that our WAL replay
doesn't scale. On the primary server, you can have (and they did) a huge
RAID array with dozens of disks, and a lot of concurrent activity
keeping it busy. On the standby, we do all the same work, but with a
single process. Every time we need to read in a page to modify it, we
block. No matter how many disks you have in the array, it won't help,
because we only issue one I/O request at a time.

That said, I think the change we made in Spring to not read in pages for
full page writes will help a lot with that. It would be nice to see some
new benchmark results to measure that. However, it didn't fix the
underlying scalability problem.

One KISS approach would be to just do full page writes more often. It
would obviously bloat the WAL, but it would make the replay faster.

Another reason you would care about fast recovery is PITR. If you do
base backups only once a week, for example, when you need to recover
using the archive, you might have to replay a weeks worth of WAL in the
worst case. You don't want to wait a week for the replay to finish.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#47)
Re: [GENERAL] Slow PITR restore

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in
May. They had setup two identical systems, one running the benchmark,
and another one as a warm stand-by. The stand-by couldn't keep up; it
couldn't replay the WAL as quickly as the primary server produced it.
IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

[ shrug... ] This is not consistent with my experience. I can't help
suspecting misconfiguration; perhaps shared_buffers much smaller on the
backup, for example.

One KISS approach would be to just do full page writes more often. It
would obviously bloat the WAL, but it would make the replay faster.

... at the cost of making the primary lots slower.

regards, tom lane

#49Gregory Stark
stark@enterprisedb.com
In reply to: Heikki Linnakangas (#40)
Re: [GENERAL] Slow PITR restore

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

It would be interesting to do something like that to speed up replay of long
PITR archives, though. You could scan all (or at least far ahead) the WAL
records, and make note of where there is full page writes for each page.
Whenever there's a full page write further ahead in the log, you could ignore
all changes to that page before that, because they're going to be overwritten
anyway.

Hm, you could generate that data when you generate the logs. Keep a hash of
block number and when the last full page write was. Then whenever you switch
log files dump out that hash in a second file alongside.

PITR recovery could read that in before it starts reading any file and consult
it before applying any records.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#50Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#48)
Re: [GENERAL] Slow PITR restore

Tom,

[ shrug... ] This is not consistent with my experience. I can't help
suspecting misconfiguration; perhaps shared_buffers much smaller on the
backup, for example.

You're only going to see it on SMP systems which have a high degree of CPU
utilization. That is, when you have 16 cores processing flat-out, then
the *single* core which will replay that log could certainly have trouble
keeping up. And this wouldn't be an issue which would show up testing on
a dual-core system.

I don't have extensive testing data on that myself (I depended on Koichi's
as well) but I do have another real-world case where our slow recovery
time is a serious problem: clustered filesystem failover configurations,
e.g. RHCFS, OpenHACluster, Veritas. For those configuratons, when one
node fails PostgreSQL is started on a 2nd node against the same data ...
and goes through recovery. On very high-volume systems, the recovery can
be quite slow, up to 15 minutes, which is a long time for a web site to be
down.

I completely agree that we don't want to risk the reliability of recovery
in attempts to speed it up, though, so maybe this isn't something we can
do right now. But I don't agree that it's not an issue for users.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#50)
Re: [GENERAL] Slow PITR restore

Josh Berkus <josh@agliodbs.com> writes:

Tom,

[ shrug... ] This is not consistent with my experience. I can't help
suspecting misconfiguration; perhaps shared_buffers much smaller on the
backup, for example.

You're only going to see it on SMP systems which have a high degree of CPU
utilization. That is, when you have 16 cores processing flat-out, then
the *single* core which will replay that log could certainly have trouble
keeping up.

You are supposing that replay takes as much CPU as live query
processing, which is nonsense (at least as long as we don't load it
down with a bunch of added complexity ;-)).

The argument that Heikki actually made was that multiple parallel
queries could use more of the I/O bandwidth of a multi-disk array
than recovery could. Which I believe, but I question how much of a
real-world problem it is. For it to be an issue, you'd need a workload
that is almost all updates (else recovery wins by not having to
replicate reads of pages that don't get modified) and the updates have
to range over a working set significantly larger than physical RAM
(else I/O bandwidth won't be the bottleneck anyway). I think we're
talking about an extremely small population of real users.

regards, tom lane
3e

#52Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#51)
Re: [GENERAL] Slow PITR restore

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

The argument that Heikki actually made was that multiple parallel
queries could use more of the I/O bandwidth of a multi-disk array
than recovery could. Which I believe, but I question how much of a
real-world problem it is. For it to be an issue, you'd need a workload
that is almost all updates (else recovery wins by not having to
replicate reads of pages that don't get modified) and the updates have
to range over a working set significantly larger than physical RAM
(else I/O bandwidth won't be the bottleneck anyway). I think we're
talking about an extremely small population of real users.

Of course that describes most benchmarks pretty well...

I think of this as a scalability problem, not so much a sheer speed problem.
If Postgres isn't fast enough for you you should be able to buy a faster
processor or faster disk or faster something to run it faster. The problem
with this situation is that buying a faster raid controller doesn't help you.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#53Markus Schiltknecht
markus@bluegap.ch
In reply to: Alvaro Herrera (#36)
Re: [GENERAL] Slow PITR restore

Hi,

Alvaro Herrera wrote:

Simon Riggs wrote:

ISTM its just autovacuum launcher + Hot Standby mixed.

I don't think you need a launcher at all. Just get the postmaster to
start a configurable number of wal-replay processes (currently the
number is hardcoded to 1).

I also see similarity to what I do for Postgres-R: a manager and helper
backends which can be started upon request. Such a scheme is currently
used for autovacuum, I'm using it for replication, it could help for
parallelizing recovery and it certainly helps for parallelizing queries
as discussed in another thread.

Maybe it's worth considering a general framework for such a manager or
auto launcher, as well as helper backends. It certainly depends on the
complexity of that manager, but it should probably better be an external
process.

What all of the helper backends have in common, AFAICT:

- a connection to a database
- no client connection
- superuser privileges

(For parallelized queries, superuser privileges might appear wrong, but
I'm arguing that parallelizing the rights checking isn't worth the
trouble, so the initiating worker backend should do that and only
delegate safe jobs to hepler backends. Or is that a serious limitation
in a way?)

Most code for that already exists, as we already have various helpers.
What's missing, IMO, is a communication channel between the worker and
helper backends as well as between the backends and the manager. That's
needed i.e. for worker backends being able to request helper backends
and feed them with their wishes.

Unix pipes can only be set up between the parent and the child of a
fork, they eat file descriptors, need to copy data to the kernel and
back and IIRC, there were portability issues. That's why I've written
the internal message passing (IMessage) stuff, see -patches [1]last time I published IMessage stuff on -patches, WIP: http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php.

I'm all for unifying such a manager process and generalizing the
requesting and launching of helpers as well as management of their state
(handling died helper processes, keeping a pool of idle helpers which
are already connected to a database, etc..). Most of that already exists
in my Postgres-R code, maybe I can derive a general purpose patch to
start contributing code from Postgres-R?

Comments? Use cases I'm missing?

Regards

Markus

[1]: last time I published IMessage stuff on -patches, WIP: http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php

#54Zeugswetter Andreas ADI SD
Andreas.Zeugswetter@s-itsolutions.at
In reply to: Greg Smith (#44)
Re: [GENERAL] Slow PITR restore

Note that even though the processor is 99% in wait state the drive

is

only handling about 3 MB/s. That translates into a seek time of

2.2ms

which is actually pretty fast...But note that if this were a raid

array

Postgres's wouldn't be getting any better results. A Raid array

wouldn't

improve i/o latency at all and since it's already 99% waiting for

i/o

Postgres is not going to be able to issue any more.

If it's a straight stupid RAID array, sure. But when you introduce a

good

write caching controller into the mix, that can batch multiple writes,

take advantage of more elevator sorting, and get more writes/seek
accomplished. Combine that improvement with having multiple drives as

well and the PITR performance situation becomes very different; you

really

can get more than one drive in the array busy at a time. It's also

true

that you won't see everything that's happening with vmstat because the

controller is doing the low-level dispatching.

I don't follow. The problem is not writes but reads. And if the reads
are
random enough no cache controller will help.

The basic message is, that for modern IO systems you need to make sure
that
enough parallel read requests are outstanding. Write requests are not an
issue,
because battery backed controllers can take care of that.

Andreas

#55Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas ADI SD (#54)
Re: [GENERAL] Slow PITR restore

On Fri, 2007-12-14 at 10:51 +0100, Zeugswetter Andreas ADI SD wrote:

The problem is not writes but reads.

That's what I see.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#56Hannu Krosing
hannu@skype.net
In reply to: Markus Schiltknecht (#53)
Re: [GENERAL] Slow PITR restore

Ühel kenal päeval, R, 2007-12-14 kell 10:39, kirjutas Markus
Schiltknecht:

Hi,

(For parallelized queries, superuser privileges might appear wrong, but
I'm arguing that parallelizing the rights checking isn't worth the
trouble, so the initiating worker backend should do that and only
delegate safe jobs to hepler backends. Or is that a serious limitation
in a way?)

at least functions defined with SECURITY DEFINER; may be a problem

------------
Hannu

#57Hannu Krosing
hannu@skype.net
In reply to: Heikki Linnakangas (#40)
Re: [GENERAL] Slow PITR restore

Ühel kenal päeval, N, 2007-12-13 kell 20:25, kirjutas Heikki
Linnakangas:
...

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going
to be issuing only one read request at a time. The order of those
requests will be better from cache hit point of view, which is good, but
the problem remains that if the modified data blocks are scattered
around the database, you'll be doing random I/O, one request at a time.

Why one-at-a-time ?

You could have a long list of pages need to read in, and ask for them
all at the same time.

Here's what I mean

1 ) allocate buffers for N database pages, and a queue for N wal records
2 ) read N wal records to wal record queue, assign database page numbers
from these to buffer pages and issue posix_fadvise() for all as you go.
2a ) if there were repeated pages and thus there are free buffers,
allocate queu items and read some more wal records and assign buffer and
fadvise until N fubbers used
3) process wal record queue to buffers read in by 2
4) write the buffers back to disk

repeat from 2), freeing LRU buffers

Here reads in 2) will be optimised by system via posix_fadvise, and also
the caches can be split between multiple workers by page number hash or
some other random/uniform means to use more than one CPU

-------------
Hannu

#58Markus Schiltknecht
markus@bluegap.ch
In reply to: Hannu Krosing (#56)
Re: [GENERAL] Slow PITR restore

Hello Hannu,

Hannu Krosing wrote:

(For parallelized queries, superuser privileges might appear wrong, but
I'm arguing that parallelizing the rights checking isn't worth the
trouble, so the initiating worker backend should do that and only
delegate safe jobs to hepler backends. Or is that a serious limitation
in a way?)

at least functions defined with SECURITY DEFINER; may be a problem

Uhm.. what I had in mind was parallelizing seqential scans, index scans,
joins and such - database internal stuff.

Parallelizing user defined functions (or what did you have in mind?) is
more difficult and sometimes impossible, because the planner cannot know
ahead, what the function's going to do.

However, thinking about it, maybe, one could and should try to
parallelize computationally intensive IMMUTABLE functions. But already
with STABLE ones I'm getting suspicious. It would require users to write
real thread-safe (well, multi-process-safe) functions, which I doubt
somewhat. Granted, they theoretically *should* be safe, but...

Anyway, if that's the only show stopper, one could certainly tell helper
backends to substitute their superuser privileges with the invoker's
privileges. Not sure if that's worth the trouble, though.

Regards

Markus

#59Markus Schiltknecht
markus@bluegap.ch
In reply to: Hannu Krosing (#57)
Re: [GENERAL] Slow PITR restore

Hannu Krosing wrote:

until N fubbers used

..whatever a fubber is :-)

Nice typo!

Markus

#60Greg Smith
gsmith@gregsmith.com
In reply to: Zeugswetter Andreas ADI SD (#54)
Re: [GENERAL] Slow PITR restore

On Fri, 14 Dec 2007, Zeugswetter Andreas ADI SD wrote:

I don't follow. The problem is not writes but reads. And if the reads
are random enough no cache controller will help.

The specific example Tom was running was, in his words, "100% disk write
bound". I was commenting on why I thought that was on his system and why
it wasn't representative of the larger problem. You need at least a basic
amount of write caching for this situation before the problem moves to
being read seek bound.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#61Wes
wespvp@msg.bt.com
In reply to: Tom Lane (#24)
REINDEX on large DB vs. DROP INDEX/CREATE INDEX

Seems like it would be a common question, but I'm having problems finding an
answer in the archives on this...

I have a large database (now about 2 billion records), and about once a year
I have been dropping and recreating the indexes. Recreating the foreign key
constraints takes as long or longer than the index creation. Last year the
whole process took about 72 hours. This year the DB is bigger.

I'm running 8.1.4. Assume I have exclusive access to the DB.

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE. I seem to remember DROP/CREATE had some advantages, but can't
find the information. Is there a performance hit with REINDEX during
creation because of locking issues?

2. I'm assuming REINDEX would avoid the time involved in recreating the
foreign key constraints?

3. With a REINDEX DATABASE, how can I monitor progress?

Wes

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Wes (#61)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

Wes <wespvp@msg.bt.com> writes:

I'm running 8.1.4. Assume I have exclusive access to the DB.

You really ought to update to 8.1.something-newer, but I digress.

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE.

No, not if you don't mind exclusive locks. DROP together with CREATE
INDEX CONCURRENTLY might be nicer if you were trying to do this without
completely shutting down the DB, but if you aren't running normal
operations then just use REINDEX.

2. I'm assuming REINDEX would avoid the time involved in recreating the
foreign key constraints?

Right, that's one reason to do it that way.

3. With a REINDEX DATABASE, how can I monitor progress?

It should give you a NOTICE after each table.

BTW, what have you got maintenance_work_mem set to?

regards, tom lane

#63Wes
wespvp@msg.bt.com
In reply to: Tom Lane (#62)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

On 1/24/08 12:48 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Wes <wespvp@msg.bt.com> writes:

I'm running 8.1.4. Assume I have exclusive access to the DB.

You really ought to update to 8.1.something-newer, but I digress.

I was planning on upgrading to 8.x at the same time as this reindex - just
do a dump/reload, but...

I guess I've been asleep at the wheel and didn't realize 8.1.11 was out.
Since that wouldn't require a DB reload, I guess that would be highly
recommended?

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE.

No, not if you don't mind exclusive locks. DROP together with CREATE
INDEX CONCURRENTLY might be nicer if you were trying to do this without
completely shutting down the DB, but if you aren't running normal
operations then just use REINDEX.

I was thinking there was some clean up that didn't happen with REINDEX,
related to disk space allocation maybe? Perhaps this was just on older 7.x
versions - something I vaguely recall back under 7.x when I was having some
corruption issues.

3. With a REINDEX DATABASE, how can I monitor progress?

It should give you a NOTICE after each table.

Is there anything that shows up in ps for each index it is working on?

BTW, what have you got maintenance_work_mem set to?

It is currently set to 983025. Not sure where I got that strange number
from. It's a 2 GB machine. I've been trying to get more, but when it runs
fine day to day, it's kind of hard to justify. Lots of disks, not so much
memory.

I guess I should also turn off fsync for the duration.

Wes

#64Gregory Stark
stark@enterprisedb.com
In reply to: Wes (#63)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

"Wes" <wespvp@msg.bt.com> writes:

I guess I should also turn off fsync for the duration.

It shouldn't make a big difference. fsync only happens at the end of a
transaction or at a checkpoint.

Since you're concerned with very long operations the slowdown at the end of
the transaction won't make a big difference.

Checkpoints could be an issue, but you would be better off just raising
checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
more often than once every few minutes when you're doing large operations like
this.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#65Wes
wespvp@msg.bt.com
In reply to: Gregory Stark (#64)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

On 1/25/08 5:40 AM, "Gregory Stark" <stark@enterprisedb.com> wrote:

It shouldn't make a big difference. fsync only happens at the end of a
transaction or at a checkpoint.

Since you're concerned with very long operations the slowdown at the end of
the transaction won't make a big difference.

Ok, I didn't realize that.

Checkpoints could be an issue, but you would be better off just raising
checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
more often than once every few minutes when you're doing large operations like
this.

I have checkpoint_segments set to 60, and no warnings showing up in the log.

Wes

#66Wes
wespvp@msg.bt.com
In reply to: Tom Lane (#62)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

Just a follow-up on this... The REINDEX took about 2 1/2 days. I didn't
gain much disk space back - a full backup takes just as long as before, but
the vacuum time dropped from 30 hours to 3 hours.

Wes

Show quoted text

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE.

No, not if you don't mind exclusive locks. DROP together with CREATE
INDEX CONCURRENTLY might be nicer if you were trying to do this without
completely shutting down the DB, but if you aren't running normal
operations then just use REINDEX.

2. I'm assuming REINDEX would avoid the time involved in recreating the
foreign key constraints?

Right, that's one reason to do it that way.

#67Vivek Khera
vivek@khera.org
In reply to: Wes (#66)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

On Feb 4, 2008, at 10:00 AM, Wes wrote:

Just a follow-up on this... The REINDEX took about 2 1/2 days. I
didn't
gain much disk space back - a full backup takes just as long as
before, but
the vacuum time dropped from 30 hours to 3 hours.

what you need to do is compare the relpages from the pg_class table
for that index before and after.

if you didn't get much disk space back, make sure you have no long
running transactions that may have kept some older files open.

#68Wes
wespvp@msg.bt.com
In reply to: Vivek Khera (#67)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

On 2/4/08 9:53 AM, "Vivek Khera" <vivek@khera.org> wrote:

what you need to do is compare the relpages from the pg_class table
for that index before and after.

if you didn't get much disk space back, make sure you have no long
running transactions that may have kept some older files open.

I can check that next time, but I only reindex about once a year. There
definitely should be no outstanding transactions.

The reason for the huge change in the vacuum time is that the indexes are
scanned in index order instead of disk order. I understand that is fixed in
8.2 or 8.3 (don't recall which I saw it in), but have never gotten
confirmation from anyone on that.

Wes

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Wes (#68)
Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX

Wes <wespvp@msg.bt.com> writes:

The reason for the huge change in the vacuum time is that the indexes are
scanned in index order instead of disk order. I understand that is fixed in
8.2 or 8.3 (don't recall which I saw it in), but have never gotten
confirmation from anyone on that.

Yeah, 8.2.

regards, tom lane