7.3.1 takes long time to vacuum table?

Started by Mark Cave-Aylandabout 23 years ago29 messagesgeneral
Jump to latest
#1Mark Cave-Ayland
m.cave-ayland@webbased.co.uk

Hi everyone,

Does anyone know of any issues with 7.3.1 which would cause it to take a
long time to do a vacuum full? We have a resource table consisting of
about 70M records and we have recently performed an update query on the
entire table and now I'm trying to do a vacuum full to reclaim back the
disk space. So far on a dual PIII 1.4GHz machine with hardware RAID5 and
2GB RAM, the vacuum full verbose is still running after 30 hours!

The reason I am concerned is that before performing the update, the
vacuum full would normally take about 12 hours and that was with a
trigger and 5 indexes on the table. This time, before initiating the
vacuum, all the indexes were dropped, and a single update performed on 1
field over the entire table. I understand that postgres has to compact
the valid tuples down to the front of the file after removing the
previous ones, but should it really take this long on such a powerful
machine? Or have I made a gross error somewhere in the configuration?

When I first configured the database, vacuuming the table took nearly 20
hours, but thanks to a post in the archives, I found some suggestions to
increase vacuum mem & FSM size and reduce the number of buffers which
got it down to its normal time of about 12 hours :) Here are the
settings I changed from the defaults based on that post:

Max_fsm_relations = 50000
Max_fsm_pages = 5000000
Vacuum_mem = 65535
Fsync = false

I have also set shmmax to 800Mb just to give things some breathing
space. One thing I have noticed is that the postmaster process running
the vacuum has now reached 1Gb of memory and looks like it is beginning
to touch swap(!) which is going to slow things even more. Can anyone
help me out and reduce the time it takes to do this vacuum?

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Mark Cave-Ayland (#1)
Re: 7.3.1 takes long time to vacuum table?

On Tue, Feb 18, 2003 at 03:20:36PM -0000, Mark Cave-Ayland wrote:

Hi everyone,

Does anyone know of any issues with 7.3.1 which would cause it to take a
long time to do a vacuum full? We have a resource table consisting of
about 70M records and we have recently performed an update query on the
entire table and now I'm trying to do a vacuum full to reclaim back the
disk space. So far on a dual PIII 1.4GHz machine with hardware RAID5 and
2GB RAM, the vacuum full verbose is still running after 30 hours!

Ok, this is a hard one. Doing a vacuum full is very hard on the disk cache.
It basically copies a lot of tuples around. In your case it's going to be
copying every tuple from somewhere near the end of the table to somewhere
near the beginning. This makes the pattern of disk access something like:

seek, read, seek, write, seek, read, seek, write, ...

Which, not surprisingly, sucks (especially from RAID5 I think, but I'm not
sure about that). Note this varies a bit between versions of Unix,
postgresql and your C library.

My only advice is that you can use strace to work out approximatly where
it's up to. Use /proc/ to work out which file descriptor is the table you're
working with and then strace the backend (the -p option) to work out which
part it is reading from. It'll look like:

seek(<fd>,<offset>,SEEK_SET) = <offset>
read(<fd>,"lots of rubbish", 8192) = 8192

It's the offset you want, it may jump around a bit but it should be
increasing on the whole. If your table is split into multiple files because
it's over 1GB, take this into account when working out how far it's in.

This is probably a good argument to have VACUUM emit a notice every 10
minutes or so giving some indication of its progress. I don't know how hard
this would be.

I have also set shmmax to 800Mb just to give things some breathing
space. One thing I have noticed is that the postmaster process running
the vacuum has now reached 1Gb of memory and looks like it is beginning
to touch swap(!) which is going to slow things even more. Can anyone
help me out and reduce the time it takes to do this vacuum?

The only thing I can suggest is that SELECT * INTO newtables FROM table; may
have been faster, though it shouldn't be.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#3Peter Childs
blue.dragon@blueyonder.co.uk
In reply to: Martijn van Oosterhout (#2)
Re: 7.3.1 takes long time to vacuum table?

On Wednesday 19 February 2003 02:14, Martijn van Oosterhout wrote:

On Tue, Feb 18, 2003 at 03:20:36PM -0000, Mark Cave-Ayland wrote:

Hi everyone,

This is probably a good argument to have VACUUM emit a notice every 10
minutes or so giving some indication of its progress. I don't know how hard
this would be.

Try Vacuum Verbose; It does not print a message out every 10 minites but it
should tell you that somthing is happerning, oh and whats happerning if you
can make any sence of the messages.....

Peter Childs

#4Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Peter Childs (#3)
Re: 7.3.1 takes long time to vacuum table?

-----Original Message-----
From: Martijn van Oosterhout [mailto:kleptog@svana.org]
Sent: 19 February 2003 02:14
To: Mark Cave-Ayland
Cc: PostgreSQL General
Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table?

Hi Martijn,

Thanks for taking the time to look at this one.

Ok, this is a hard one. Doing a vacuum full is very hard on the disk
cache.
It basically copies a lot of tuples around. In your case it's going to

be

copying every tuple from somewhere near the end of the table to

somewhere

near the beginning. This makes the pattern of disk access something

like:

seek, read, seek, write, seek, read, seek, write, ...

Which, not surprisingly, sucks (especially from RAID5 I think, but I'm

not

sure about that). Note this varies a bit between versions of Unix,
postgresql and your C library.

Hmmm.... so the news is not good then :( We timed an insert of about
800,000 records into the DB under 7.2.1 and that was just over 10mins...
so extrapolating that out it would take at least 15 hours just to write
the direct 70M records to disk without the seeking.... *sigh*

My only advice is that you can use strace to work out approximatly

where

it's up to. Use /proc/ to work out which file descriptor is the table
you're
working with and then strace the backend (the -p option) to work out

which

part it is reading from. It'll look like:

seek(<fd>,<offset>,SEEK_SET) = <offset>
read(<fd>,"lots of rubbish", 8192) = 8192

It's the offset you want, it may jump around a bit but it should be
increasing on the whole. If your table is split into multiple files
because
it's over 1GB, take this into account when working out how far it's

in.

Ok, I've managed to do that.... offset just seems to keep jumping around
though, not much of a specific pattern... although it does seem to be
reading from 2 separate files :( But what IS interesting is that between
each seek(), postgres does 2 reads of 8k (which I guess is a page) and
then 4 writes! This I don't understand? Surely given the memory
parameters then it should read as many pages into memory as possible,
sort them, then seek back and write them? What appears to be happening
is that it is only one or two pages are being moved at a time which
seems really inefficient.

Or is the assumption here that by limiting the pages being moved around,
more memory can be given to the OS so it can cache aggressively? I've
uploaded the strace file to
http://www.infomapper.com/strace/strace.log.txt if you (or anyone else)
would be interested in taking a look - I logged about 700ks worth.

This is probably a good argument to have VACUUM emit a notice every 10
minutes or so giving some indication of its progress. I don't know how
hard
this would be.

The vacuum has emitted a line of statistics within the first few hours
which reads the following:

INFO: --Relation public.res--
INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New 0;
Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72,
MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400;
EndEmpty/Avail. Pages 0/8458521.
CPU 733.62s/151.22u sec elapsed 4656.51 sec.

But now it's just sat there churning away.... I guess the above was the
result of marking which tuples were to be kept and now I'm guessing its
in the process of moving data around. So yes, some additional
notification during this phase would be very useful for large tables
like this.

I have also set shmmax to 800Mb just to give things some breathing
space. One thing I have noticed is that the postmaster process

running

the vacuum has now reached 1Gb of memory and looks like it is

beginning

to touch swap(!) which is going to slow things even more. Can anyone
help me out and reduce the time it takes to do this vacuum?

The only thing I can suggest is that SELECT * INTO newtables FROM

table;

may
have been faster, though it shouldn't be.

My instinct would suggest this would be faster, given that the writing
would be guaranteed sequential, and a single sequential scan on the
table. The problem with this is that we don't have enough disk space to
maintain 2 copies of the table on one device at the moment as the table
takes about 40Gb space.

BTW the vacuum is of course still going as we enter the 50hr stage....

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#5Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Mark Cave-Ayland (#4)
Re: 7.3.1 takes long time to vacuum table?

-----Original Message-----
From: Peter Childs [mailto:Blue.Dragon@blueyonder.co.uk]
Sent: 19 February 2003 06:05
To: PostgreSQL General
Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table?

This is probably a good argument to have VACUUM emit a notice every

10

minutes or so giving some indication of its progress. I don't know

how

hard

this would be.

Try Vacuum Verbose; It does not print a message out every 10

minites

but it
should tell you that somthing is happerning, oh and whats happerning

if

you
can make any sence of the messages.....

Peter Childs

Hi Peter,

I am running vacuum verbose, however after displaying an information
line a few hours in, it won't output any information now until the
vacuum is complete as I guess it's now just moving pages around.... :(

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#6Martijn van Oosterhout
kleptog@svana.org
In reply to: Mark Cave-Ayland (#4)
Re: 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 11:03:12AM -0000, Mark Cave-Ayland wrote:

Ok, I've managed to do that.... offset just seems to keep jumping around
though, not much of a specific pattern... although it does seem to be
reading from 2 separate files :( But what IS interesting is that between
each seek(), postgres does 2 reads of 8k (which I guess is a page) and
then 4 writes! This I don't understand? Surely given the memory
parameters then it should read as many pages into memory as possible,
sort them, then seek back and write them? What appears to be happening
is that it is only one or two pages are being moved at a time which
seems really inefficient.

Hmm, odd. The offset is going *backwards*. It's starting at 981934080 and
made it down to 969719808. It looks like it's reading each page. I don't
think I need to tell you that from a caching point of view, it's not probably
not as good as going forward.

Or is the assumption here that by limiting the pages being moved around,
more memory can be given to the OS so it can cache aggressively? I've
uploaded the strace file to
http://www.infomapper.com/strace/strace.log.txt if you (or anyone else)
would be interested in taking a look - I logged about 700ks worth.

How long did it take to get that trace? Also, what are file descriptors 58,
97 and 114?

The vacuum has emitted a line of statistics within the first few hours
which reads the following:

INFO: --Relation public.res--
INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New 0;
Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72,
MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400;
EndEmpty/Avail. Pages 0/8458521.
CPU 733.62s/151.22u sec elapsed 4656.51 sec.

Ok, that means it has finished the vacuum stage (since that's printed at the
end of scan_heap). It's now going through the heap compacting. From that
output it appears that your table is around 75Gig (9 million 8K pages)!
You're into the repair_frag stage where indeed it scans through the table
backwards.

While I can see the argument for going backwards, from a caching perspective
I think it's terrible. Especially in the case where the entire table has
been replaced, the entire exercise becomes a very expensive copy operation.

But now it's just sat there churning away.... I guess the above was the
result of marking which tuples were to be kept and now I'm guessing its
in the process of moving data around. So yes, some additional
notification during this phase would be very useful for large tables
like this.

Let me know what those file descriptors point to and we can probably work
out how far along it is.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#7Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Martijn van Oosterhout (#6)
Re: 7.3.1 takes long time to vacuum table?

Hi Martijn,

Thanks again for your reply.

-----Original Message-----
From: Martijn van Oosterhout [mailto:kleptog@svana.org]
Sent: 19 February 2003 12:11
To: Mark Cave-Ayland
Cc: PostgreSQL General
Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 11:03:12AM -0000, Mark Cave-Ayland wrote:

Ok, I've managed to do that.... offset just seems to keep jumping

around

though, not much of a specific pattern... although it does seem to

be

reading from 2 separate files :( But what IS interesting is that

between

each seek(), postgres does 2 reads of 8k (which I guess is a page)

and

then 4 writes! This I don't understand? Surely given the memory
parameters then it should read as many pages into memory as

possible,

sort them, then seek back and write them? What appears to be

happening

is that it is only one or two pages are being moved at a time which
seems really inefficient.

Hmm, odd. The offset is going *backwards*. It's starting at 981934080

and

made it down to 969719808. It looks like it's reading each page. I

don't

think I need to tell you that from a caching point of view, it's not
probably
not as good as going forward.

Backwards? Ouch I guess that will break most caching strategies! But as
I said before, do you know why it only appears to be moving one or two
pages at a time before seeking again???

How long did it take to get that trace? Also, what are file

descriptors

58,
97 and 114?

The trace lasted about a couple of minutes. I've listed the filesystems
from /proc and uploaded them to
http://www.infomapper.com/strace/fd.log.txt for you to look - I have a
feeling you would find it useful to see the total number of files open
in terms of their size and quantity....

The vacuum has emitted a line of statistics within the first few

hours

which reads the following:

INFO: --Relation public.res--
INFO: Pages 9167290: Changed 5366493, reaped 6000857, Empty 0, New

0;

Tup 72475389: Vac 105298234, Keep/VTL 0/0, UnUsed 221144, MinLen 72,
MaxLen 727; Re-using: Free/Avail. Space 48478844644/48455035400;
EndEmpty/Avail. Pages 0/8458521.
CPU 733.62s/151.22u sec elapsed 4656.51 sec.

Ok, that means it has finished the vacuum stage (since that's printed

at

the
end of scan_heap). It's now going through the heap compacting. From

that

output it appears that your table is around 75Gig (9 million 8K

pages)!

You're into the repair_frag stage where indeed it scans through the

table

backwards.

Hmmm, a little big bigger than I was expecting, although I guess that
contains old and new pages. Here is the select result:

SELECT relname, relpages FROM pg_class ORDER BY relpages DESC;
relname | relpages
---------------------------------+----------
res | 5586167

...so at 8Kb/page then that's about 42Gb.

While I can see the argument for going backwards, from a caching
perspective
I think it's terrible. Especially in the case where the entire table

has

been replaced, the entire exercise becomes a very expensive copy
operation.

Agreed! If it ruins any caching then in my view it's something that has
to change in order to keep performance. While there may be a penalty to
pay on smaller tables, the benefits of caching would more than make up
for the cost of going forwards - imagine how slow CPUs would be if
everything was a cache miss....

Let me know what those file descriptors point to and we can probably

work

out how far along it is.

Yes please, that would be really useful for us to know.

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#8Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Mark Cave-Ayland (#7)
Re: 7.3.1 takes long time to vacuum table?

On Wednesday 19 Feb 2003 7:25 pm, you wrote:

Agreed! If it ruins any caching then in my view it's something that has
to change in order to keep performance. While there may be a penalty to
pay on smaller tables, the benefits of caching would more than make up
for the cost of going forwards - imagine how slow CPUs would be if
everything was a cache miss....

I have an idea. Don't know what to call it, better or stupid.

If you are doing vacuum full, it means you are ready to lock the database. I
am just wondering if it would be fast to dump the database, drop it and
recreate it.

Good old defragmentation of ext2fs..;-) Of course, having a spare 80Gigs might
turn out to be a problem but I can almost bet that it would finish before 12
hours..

Shridhar

#9Martijn van Oosterhout
kleptog@svana.org
In reply to: Mark Cave-Ayland (#7)
Re: 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 01:55:01PM -0000, Mark Cave-Ayland wrote:

How long did it take to get that trace? Also, what are file

descriptors

58,
97 and 114?

The trace lasted about a couple of minutes. I've listed the filesystems
from /proc and uploaded them to
http://www.infomapper.com/strace/fd.log.txt for you to look - I have a
feeling you would find it useful to see the total number of files open
in terms of their size and quantity....

58 = 55792/58153.15
97 = 55792/58153.54
114 = pg_xlog/000000A50000000E

So it's writing to WAL. I don't know if I'm sure about this but if it's
currently at file no 54 and you need to end up with 45 (each file 1GB) and
you started on file 69 that means you're over halfway. But I'm really not
sure.

Agreed! If it ruins any caching then in my view it's something that has
to change in order to keep performance. While there may be a penalty to
pay on smaller tables, the benefits of caching would more than make up
for the cost of going forwards - imagine how slow CPUs would be if
everything was a cache miss....

I'd like to get one of the developers views on this.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#10Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Martijn van Oosterhout (#9)
Re: 7.3.1 takes long time to vacuum table?

Hi Martijn, Shridhar,

Agreed! If it ruins any caching then in my view it's something that

has

to change in order to keep performance. While there may be a penalty

to

pay on smaller tables, the benefits of caching would more than make

up

for the cost of going forwards - imagine how slow CPUs would be if
everything was a cache miss....

I'd like to get one of the developers views on this.

Just looking at Shridhar's email, and Martin's earlier comment about
SELECT * INTO newtable FROM oldtable, I'm rapidly seeing that this would
be the better way go. Because instead of seeking around millions of
pages with the vacuum, it would be less work to do this because then it
is just a sequential read and a sequential write.

Interestingly this could be used to create a speedy vacuum - that is,
create a new table with a temporary name that is invisible to the
database (similar to dropped columns), then taking into account the disk
space left on the device, pick the last X pages from the old table and
write to the new table. Then truncate the file containing the table at
point X and repeat until finished. Finally kill the old table and make
the new one visible. I appreciate there may be a couple of issues with
oids/foreign keys but it sounds like a great solution to me! Why does
vacuum bother with reordering rows? I thought that was what the CLUSTER
command was for? Any developers care to comment on this? I imagine there
must be a technical reason (prob to do with MVCC) as to why this hasn't
been done before?

In fact, my colleague has just done a test with SELECT..INTO on our dev
version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a
vacuum full on the same original 600Mb table which is still going after
20mins. Difficult choice! So even in a worse case scenario we could have
a fully vacuumed table within a day.... we're looking at dropping some
indexes in the db to reclaim enough space to be able to fit another copy
of the table on the disk... this is looking very tempting at the
moment....

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Cave-Ayland (#10)
Re: 7.3.1 takes long time to vacuum table?

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

Interestingly this could be used to create a speedy vacuum - that is,
create a new table with a temporary name that is invisible to the
database (similar to dropped columns), then taking into account the disk
space left on the device, pick the last X pages from the old table and
write to the new table. Then truncate the file containing the table at
point X and repeat until finished. Finally kill the old table and make
the new one visible.

And if you crash midway through?

Why does vacuum bother with reordering rows?

It's designed to be fast when there's not very much data motion required
(ie, you only need to pull a relatively small number of rows off the end
to fill in the holes elsewhere).

I have not seen any actual evidence that doing it any other way would be
faster. Yes, it's reading the source tuples backwards instead of
forwards, but that's at most a third of the total I/O load (you've also
got tuple output and WAL writes to think about). It's not clear that
any kernel read-ahead optimization could get a chance to work anyhow.

In fact, my colleague has just done a test with SELECT..INTO on our dev
version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a
vacuum full on the same original 600Mb table which is still going after
20mins.

Are there indexes on the original table? If so, this isn't a fair
comparison.

regards, tom lane

#12Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Tom Lane (#11)
Re: 7.3.1 takes long time to vacuum table?

Hi Tom,

Thanks for your reply.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 19 February 2003 16:28
To: Mark Cave-Ayland
Cc: Martijn van Oosterhout; shridhar_daithankar@persistent.co.in;
PostgreSQL General
Subject: Re: [GENERAL] 7.3.1 takes long time to vacuum table?

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

Interestingly this could be used to create a speedy vacuum - that

is,

create a new table with a temporary name that is invisible to the
database (similar to dropped columns), then taking into account the

disk

space left on the device, pick the last X pages from the old table

and

write to the new table. Then truncate the file containing the table

at

point X and repeat until finished. Finally kill the old table and

make

the new one visible.

And if you crash midway through?

I don't know if I'm looking at this too simplistically but....

Each table could have an attribute to indicate that it is being vacuumed
(if one does not already exist). I imagine the rest could be based on
the existing transaction code i.e. mark the last X rows of the source
table as deleted and insert them (still invisible) into the destination
table. On transaction commit, the destination copies become visible and
source table is truncated at the file level. Or is truncation not a
transaction safe activity? If a crash occurs during the vacuum then
postgres can tell from the attribute that the table was in the process
of being vacuumed and then can use the WAL to carry on from where it
left off....

Why does vacuum bother with reordering rows?

It's designed to be fast when there's not very much data motion

required

(ie, you only need to pull a relatively small number of rows off the

end

to fill in the holes elsewhere).

I have not seen any actual evidence that doing it any other way would

be

faster. Yes, it's reading the source tuples backwards instead of
forwards, but that's at most a third of the total I/O load (you've

also

got tuple output and WAL writes to think about). It's not clear that
any kernel read-ahead optimization could get a chance to work anyhow.

I see, maybe I was a little premature in my 'vacuum bashing' :) So it's
optimized for the 'few holes' case while we are using it for a 'many
holes' case..... things make a bit more sense now.

In fact, my colleague has just done a test with SELECT..INTO on our

dev

version and it compacted 600Mb -> 400Mb in just 40s(!). We then did

a

vacuum full on the same original 600Mb table which is still going

after

20mins.

Are there indexes on the original table? If so, this isn't a fair
comparison.

Fair point actually, I should have made it a better comparison. The
source table has 5 btree indexes, each on a bigint field. However, it
has taken just under a minute to recreate the first! The vacuum full on
the original 600Mb table has finished after 100mins, so it looks as if I
used the SELECT..INTO method could be up and done in 10mins! I can
continue recreating the other indexes to get a proper final time
comparison if you are interested?

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Cave-Ayland (#12)
Re: 7.3.1 takes long time to vacuum table?

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

In fact, my colleague has just done a test with SELECT..INTO on our
dev
version and it compacted 600Mb -> 400Mb in just 40s(!). We then did
a
vacuum full on the same original 600Mb table which is still going
after
20mins.

Are there indexes on the original table? If so, this isn't a fair
comparison.

Fair point actually, I should have made it a better comparison. The
source table has 5 btree indexes, each on a bigint field. However, it
has taken just under a minute to recreate the first! The vacuum full on
the original 600Mb table has finished after 100mins, so it looks as if I
used the SELECT..INTO method could be up and done in 10mins! I can
continue recreating the other indexes to get a proper final time
comparison if you are interested?

Yeah. Also, I don't suppose you made that a VACUUM VERBOSE and kept the
output? It'd be interesting to see which stages took the most time.

regards, tom lane

#14Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#11)
Re: 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 11:28:13AM -0500, Tom Lane wrote:

"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:

Interestingly this could be used to create a speedy vacuum - that is,
create a new table with a temporary name that is invisible to the
database (similar to dropped columns), then taking into account the disk
space left on the device, pick the last X pages from the old table and
write to the new table. Then truncate the file containing the table at
point X and repeat until finished. Finally kill the old table and make
the new one visible.

And if you crash midway through?

Messy definitly.

Why does vacuum bother with reordering rows?

It's designed to be fast when there's not very much data motion required
(ie, you only need to pull a relatively small number of rows off the end
to fill in the holes elsewhere).

I have not seen any actual evidence that doing it any other way would be
faster. Yes, it's reading the source tuples backwards instead of
forwards, but that's at most a third of the total I/O load (you've also
got tuple output and WAL writes to think about). It's not clear that
any kernel read-ahead optimization could get a chance to work anyhow.

Well, consider that it's reading every single page in the table from the end
down to halfway (since every tuple was updated). If you went back in chunks
of 128K then the kernel may get a chance to cache the following blocks. On a
disk, once you are reading a sector, reading the next 128 sectors is
essentially free. The marginal cost of more sectors is almost zero.

It could be argued that the kernel should be noticing that you're scanning
backward and start its read a meg or two before where you asked. But there
are so many levels of cache (both hardware and software) that may have to
play along.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#14)
Re: 7.3.1 takes long time to vacuum table?

Martijn van Oosterhout <kleptog@svana.org> writes:

Well, consider that it's reading every single page in the table from the end
down to halfway (since every tuple was updated). If you went back in chunks
of 128K then the kernel may get a chance to cache the following
blocks.

I fear this would be optimization with blinkers on :-(. The big reason
that VACUUM FULL scans backwards is that at the very first (last?) page
where it cannot push all the tuples down to lower-numbered pages, it
can abandon any attempt to move more tuples. The file can't be made
any shorter by internal shuffling, so we should stop. If you back up
multiple pages and then scan forward, you would usually find yourself
moving the wrong tuples, ie ones that cannot help you shrink the file.

I suspect that what we really want here is a completely different
algorithm (viz copy into a new file, like CLUSTER) when the initial scan
reveals that there's more than X percent of free space in the file.

regards, tom lane

#16Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#15)
Re: 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 08:53:42PM -0500, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

Well, consider that it's reading every single page in the table from the end
down to halfway (since every tuple was updated). If you went back in chunks
of 128K then the kernel may get a chance to cache the following
blocks.

I fear this would be optimization with blinkers on :-(. The big reason
that VACUUM FULL scans backwards is that at the very first (last?) page
where it cannot push all the tuples down to lower-numbered pages, it
can abandon any attempt to move more tuples. The file can't be made
any shorter by internal shuffling, so we should stop. If you back up
multiple pages and then scan forward, you would usually find yourself
moving the wrong tuples, ie ones that cannot help you shrink the file.

I agree with the general idea. However, in this case there are 40GB+ of tuples
to move; if you moved backwards in steps of 2MB it would make no significant
difference on the resulting table. It would only be a problem near the end
of the compacting. Then you can stop, the remaining pages can surely be kept
track of in the FSM.

Next time you do a vacuum you can go back and do the compacting properly. On
tables of the size that matter here, I don't think anyone will care if the
last 2MB (=0.0044% of table) isn't optimally packed the first time round.

Does vacuum full have to produce the optimum result the first time?

I suspect that what we really want here is a completely different
algorithm (viz copy into a new file, like CLUSTER) when the initial scan
reveals that there's more than X percent of free space in the file.

You could do the jump-back-in-blocks only if more than 30% of the table is
empty and table is over 1GB. For the example here, a simple defragging
algorithm would suffice; start at beginning and pack each tuple into the
beginning of the file. It will move *every* tuple but it's more cache
friendly. It's pretty extreme though.

It does preserve table order though whereas the current algorithm will
reverse the order of all the tuples in the table, possibly causing similar
backward-scan problems later with your index-scans.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#16)
Re: 7.3.1 takes long time to vacuum table?

Martijn van Oosterhout <kleptog@svana.org> writes:

You could do the jump-back-in-blocks only if more than 30% of the table is
empty and table is over 1GB. For the example here, a simple defragging
algorithm would suffice; start at beginning and pack each tuple into the
beginning of the file. It will move *every* tuple but it's more cache
friendly. It's pretty extreme though.

And your evidence that it will actually be faster is ... ?

regards, tom lane

#18Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#17)
Re: 7.3.1 takes long time to vacuum table?

On Wed, Feb 19, 2003 at 10:37:45PM -0500, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

You could do the jump-back-in-blocks only if more than 30% of the table is
empty and table is over 1GB. For the example here, a simple defragging
algorithm would suffice; start at beginning and pack each tuple into the
beginning of the file. It will move *every* tuple but it's more cache
friendly. It's pretty extreme though.

And your evidence that it will actually be faster is ... ?

Will, in this guy's example, the first stage of the vacuum (scan_heap)
finished in a few hours but the second stage is still going after 50 hours
and estimating a bit over half done. The major difference between the two is
that the latter goes backwards through the table and the other forwards.

Sure, it's anecdotal and a bit hand wavey but I think there's something to
it. Come to think of it, if that strace had used -tt it would have been
more interesting.

Anyway, the test would be to implement it and them time it. The vacuum code
looks scary though. I'm not sure if I want to mess with it...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Support bacteria! They're the only culture some people have.

#19Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Mark Cave-Ayland (#10)
Re: 7.3.1 takes long time to vacuum table?

On Wednesday 19 Feb 2003 9:05 pm, you wrote:

Interestingly this could be used to create a speedy vacuum - that is,
create a new table with a temporary name that is invisible to the
database (similar to dropped columns), then taking into account the disk
space left on the device, pick the last X pages from the old table and
write to the new table. Then truncate the file containing the table at
point X and repeat until finished. Finally kill the old table and make
the new one visible. I appreciate there may be a couple of issues with
oids/foreign keys but it sounds like a great solution to me! Why does
vacuum bother with reordering rows? I thought that was what the CLUSTER
command was for? Any developers care to comment on this? I imagine there
must be a technical reason (prob to do with MVCC) as to why this hasn't
been done before?

Well, One thing I can think of is the extra space required. The algo. looks
good but it would be very difficult to make sure that it works all the time
especially given that postgresql does not have sophisticated and/or tunable
storage handling( think of tablespaces ).

It is always space-time trade-off. On one hand we have vacuum which uses a
constant and may be negiliible space but takes time proportional to amount of
work. On other hand we have drop/recreate table which takes double the space
but is extremely fast i.e. proportinal to data size at max. I/O bandwidth
available..

It would be good if there is in between. Of course it would not be easy to do
it. But it has to start, isn't it?..:-)

In fact, my colleague has just done a test with SELECT..INTO on our dev
version and it compacted 600Mb -> 400Mb in just 40s(!). We then did a
vacuum full on the same original 600Mb table which is still going after
20mins. Difficult choice! So even in a worse case scenario we could have
a fully vacuumed table within a day.... we're looking at dropping some
indexes in the db to reclaim enough space to be able to fit another copy
of the table on the disk... this is looking very tempting at the
moment....

I recommend this strategy of "vacuuming" be documented in standard
documentation and FAQ. Given that postgresql is routinely deployed for
databases >10GB which is greater than small/medium by any definition today, I
think this will be a good move.

Furthermore this strategy reduces the down time due to vacuum full locks
drastically. I would say it is worth buying a 80GB IDE disk for this purpose
if you have this big database..

Nice to see that my idea helped somebody..:-)

Shridhar

#20Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Shridhar Daithankar (#19)
Re: 7.3.1 takes long time to vacuum table?

Hi Tom, Martijn, Shridhar,

Yeah. Also, I don't suppose you made that a VACUUM VERBOSE and kept

the

output? It'd be interesting to see which stages took the most time.

regards, tom lane

I've got the complete set of timings for the table on our dev box. I'm
afraid the vacuum wasn't run in verbose mode so I can't give you any
more information :(. This information is based on the 600Mb table on our
dev system.

Time to vacuum full:
Total: 100m

Time to restore by copying to another table:
SELECT INTO
40s
CREATE INDEX 1 (bigint) using btree 59s
CREATE INDEX 2 (int) using btree 27s
CREATE INDEX 3 (bigint) using btree 32s
CREATE INDEX 4 (bigint) using btree 24s
CREATE INDEX 5 (varchar) using btree 3m 2s
CREATE INDEX 6 (txtidx) using gist 12m 58s
CREATE INDEX 7 (txtidx) using gist 31m 20s

Total: 51m 22s

OK, so my initial estimate of rebuilding in 10mins was way out because I
forgot about the 2 massive gist fti indexes I had on the table - doh.
However, I feel that the times are still meaningful in that I now have a
replica of the table at 400Mb (down by 200Mb) in just over half the time
that the vacuum full took to do the same job.

We stopped our vacuum on our live system at the 55hr stage whilst it was
still going - it took about another hour from sending the cancel request
until the vacuum stopped. We then deleted several indexes to claim back
enough Gb to hold a second copy of the table and we've just completed
the SELECT INTO into a new table.

And the result? It has taken a total of 1h 45m to generate a copy! Given
that we are rebuilding the table *WITHOUT* the large gist indexes on our
dev version, I guess that it would only be a matter of several hours
before we can rebuild the indexes back up on the table and be using it
again.

I hope that this goes some way to showing that implementing a new type
of vacuum feature, perhaps similar in working to the one suggested
earlier in the thread, would be greatly appreciated by people with
databases even more than 0.5G in size. I would gladly support/help out
anyone who felt they could implement such a feature in this way.

Cheers,

Sparks.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

#21Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Mark Cave-Ayland (#20)
#22Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Shridhar Daithankar (#21)
#23Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Mark Cave-Ayland (#22)
#24Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Shridhar Daithankar (#21)
#25Jean-Luc Lachance
jllachan@nsd.ca
In reply to: Mark Cave-Ayland (#10)
#26Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Jean-Luc Lachance (#25)
#27Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Mark Cave-Ayland (#26)
#28Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Mark Cave-Ayland (#27)
#29Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Jean-Luc Lachance (#25)