vacuum takes too long

Started by David Wetzelabout 27 years ago16 messages

dave@turbocat.de

about 27 years ago

Hi,

insert a few row in a table (>50000) and do
delete from mytable;
vacuum verbose analyze;

Why is this that slow?

[I am not on hackers@postgreSQL.org]
---
_ _
_(_)(_)_ David Wetzel, Turbocat's Development,
(_) __ (_) Buchhorster Strasse, D-16567 Muehlenbeck/Berlin, FRG,
_/ \_ Fax +49 33056 82835 NeXTmail dave@turbocat.de
(______) http://www.turbocat.de/
DEVELOPMENT * CONSULTING * ADMINISTRATION
WATCH OUT FOR TURBOFAX for OPENSTEP!

Bruce Momjian

maillist@candle.pha.pa.us

about 27 years ago

In reply to: David Wetzel (#1)

Re: [PORTS] vacuum takes too long

Hi,

insert a few row in a table (>50000) and do
delete from mytable;
vacuum verbose analyze;

Why is this that slow?

Analyze checks every column in every row.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

David Wetzel

dave@turbocat.de

about 27 years ago

In reply to: Bruce Momjian (#2)

Re: [PORTS] vacuum takes too long

From: Bruce Momjian <maillist@candle.pha.pa.us>

insert a few row in a table (>50000) and do
delete from mytable;
vacuum verbose analyze;

Why is this that slow?

Analyze checks every column in every row.

even if you only type "vacuum verbose" it takes _very_ long.

I deleted _all_ records with "delete from mytable;" before.
A drop and a new create is faster. But what is when you delete (maybe
100000) rows but keep 100 in the table?

I use 6.4.2 on NetBSD/i486 (that box makes gets 12MBytes/sec via the
filesystem out of the drives)
---
_ _
_(_)(_)_ David Wetzel, Turbocat's Development,
(_) __ (_) Buchhorster Strasse, D-16567 Muehlenbeck/Berlin, FRG,
_/ \_ Fax +49 33056 82835 NeXTmail dave@turbocat.de
(______) http://www.turbocat.de/
DEVELOPMENT * CONSULTING * ADMINISTRATION

Bruce Momjian

maillist@candle.pha.pa.us

about 27 years ago

In reply to: David Wetzel (#3)

Re: [PORTS] vacuum takes too long

From: Bruce Momjian <maillist@candle.pha.pa.us>

insert a few row in a table (>50000) and do
delete from mytable;
vacuum verbose analyze;

Why is this that slow?

Analyze checks every column in every row.

even if you only type "vacuum verbose" it takes _very_ long.

I deleted _all_ records with "delete from mytable;" before.
A drop and a new create is faster. But what is when you delete (maybe
100000) rows but keep 100 in the table?

I use 6.4.2 on NetBSD/i486 (that box makes gets 12MBytes/sec via the
filesystem out of the drives)

Not sure what to say. Vacuum does take a while, and it is often faster
to drop and recreate.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

The Hermit Hacker

scrappy@hub.org

about 27 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

On Wed, 6 Jan 1999, Bruce Momjian wrote:

From: Bruce Momjian <maillist@candle.pha.pa.us>

insert a few row in a table (>50000) and do
delete from mytable;
vacuum verbose analyze;

Why is this that slow?

Analyze checks every column in every row.

even if you only type "vacuum verbose" it takes _very_ long.

I deleted _all_ records with "delete from mytable;" before.
A drop and a new create is faster. But what is when you delete (maybe
100000) rows but keep 100 in the table?

I use 6.4.2 on NetBSD/i486 (that box makes gets 12MBytes/sec via the
filesystem out of the drives)

Not sure what to say. Vacuum does take a while, and it is often faster
to drop and recreate.

Let's ignore the 'analyze' part first...take a simple 'vacuum'
command...what takes the longest? My understanding is a vacuum
simplistically, takes and moves all rows "up" in the file to fill in any
blanks resulting from updates and deletes, then truncates the end of the
file...

If so, is there no way of having vacuum running on its own?

Basically, if my understanding is remotely correct, vaccum is
defragmenting the table...so why can't the defragmenting be performed
during idle time...or, at least some of it.

Start at the top of the table, go to the first 'blank' section (a deleted
record)...find the next good record that will fit in the space, move it
there...clear out the old space, etc...

if dba issues a 'vacuum', lock the table and do all records at once, but
otherwise try and vacuum the table live...

With the new MVCC serialization, this concept should be less intrusive on
readers, no?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Bruce Momjian

maillist@candle.pha.pa.us

about 27 years ago

In reply to: The Hermit Hacker (#5)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

Start at the top of the table, go to the first 'blank' section (a deleted
record)...find the next good record that will fit in the space, move it
there...clear out the old space, etc...

if dba issues a 'vacuum', lock the table and do all records at once, but
otherwise try and vacuum the table live...

With the new MVCC serialization, this concept should be less intrusive on
readers, no?

Wish I knew the answer. I can guess, but that isn't going to help.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

The Hermit Hacker

scrappy@hub.org

about 27 years ago

In reply to: Bruce Momjian (#6)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

On Wed, 6 Jan 1999, Bruce Momjian wrote:

Start at the top of the table, go to the first 'blank' section (a deleted
record)...find the next good record that will fit in the space, move it
there...clear out the old space, etc...

if dba issues a 'vacuum', lock the table and do all records at once, but
otherwise try and vacuum the table live...

With the new MVCC serialization, this concept should be less intrusive on
readers, no?

Wish I knew the answer. I can guess, but that isn't going to help.

Guess == throwing in ideas, even if they are incorrect...the way I
figure it, I through out alot of guesses...some of them spark ideas in
others and we see some really neat ideas come out of it :)

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Bruce Momjian

maillist@candle.pha.pa.us

about 27 years ago

In reply to: The Hermit Hacker (#7)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

Wish I knew the answer. I can guess, but that isn't going to help.

Guess == throwing in ideas, even if they are incorrect...the way I
figure it, I through out alot of guesses...some of them spark ideas in
others and we see some really neat ideas come out of it :)

Yes. Good. I have been bugging Vadim about possible row reuse, but I
don't know enough to understand the options.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

The Hermit Hacker

scrappy@hub.org

about 27 years ago

In reply to: Bruce Momjian (#8)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

On Wed, 6 Jan 1999, Bruce Momjian wrote:

Wish I knew the answer. I can guess, but that isn't going to help.

Guess == throwing in ideas, even if they are incorrect...the way I
figure it, I through out alot of guesses...some of them spark ideas in
others and we see some really neat ideas come out of it :)

Yes. Good. I have been bugging Vadim about possible row reuse, but I
don't know enough to understand the options.

I'm not sure about the row-reuse thing. What sort of performance hit will
it have. As it is now, you add a row by zipping down to the end, add the
row...bang, finished. with row-reuse, you have to search for a good fit,
which could take time...

Hrmmm...let's look at Oracle's "model"...bear in mind that I haven't dived
very deep into it, so I could be totally off base here, but, with Oracle,
you have a seperate "group" of processes started up for each 'instance',
where, if I'm correct, an instance is the same as our database(?)...

How hard would it be for us to implement something similar? When you
start up the postmaster, it starts up 1 postgres "master process" for each
database that it knows about. The point of the master process is
effectively the garbage collector for the database, as well as the central
'traffic cop'...

so, for example, I have 4 databases on my server...when you start up the
system with your normal 'postmaster' process, it forks off 4 processes,
one for each database. When you connect to port #### for database XXXX,
the listening process (main postmaster) shunts the process over to the
appropriate 'traffic cop' for handling...

The 'traffic cop' would keep track of the number of connections to the
database are currently open, and when zero, which woudl indicate idle
time, process through a table in the database to clean it up. As soon as
a new connection comes in, it would "finish" its cleanup by making sure
the table is in a 'sane state' (ie. finish up with its current record) and
then fork off the process, to wait quietly until its idle again...

Then each database could effectively have their own shared memory pool
that could be adjusted on a per database basis. Maybe even add a 'change
threshold', where after X transactions (update, insert or delete), the
table gets auto-vacuum'd (no analyze, just vacuum)...the threshold could
be set on a per-table basis...the 'traffic cop' should be able to easily
keep track of those sort of stats internally...no?

Hell, the 'traffic cop' *should* be able to keep reasonably accurate stats
to update the same tables that a 'vacuum analyze' maintains, adjusting
those values periodically to give a semi-accurate picture. Periodically,
a normal 'analyze' would have to be run...

Its a thought...haven't got a clue as to the complexity of implementing,
but...*shrug*

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#10

Jackson, DeJuan

djackson@cpsgroup.com

about 27 years ago

In reply to: The Hermit Hacker (#9)

RE: [HACKERS] Re: [PORTS] vacuum takes too long

Wish I knew the answer. I can guess, but that isn't

going to help.

Guess == throwing in ideas, even if they are

incorrect...the way I

figure it, I through out alot of guesses...some of them

spark ideas in

others and we see some really neat ideas come out of it :)

Yes. Good. I have been bugging Vadim about possible row

reuse, but I

don't know enough to understand the options.

I'm not sure about the row-reuse thing. What sort of
performance hit will
it have. As it is now, you add a row by zipping down to the
end, add the
row...bang, finished. with row-reuse, you have to search for
a good fit,
which could take time...

Hrmmm...let's look at Oracle's "model"...bear in mind that I
haven't dived
very deep into it, so I could be totally off base here, but,
with Oracle,
you have a seperate "group" of processes started up for each
'instance',
where, if I'm correct, an instance is the same as our database(?)...

How hard would it be for us to implement something similar? When you
start up the postmaster, it starts up 1 postgres "master
process" for each
database that it knows about. The point of the master process is
effectively the garbage collector for the database, as well
as the central
'traffic cop'...

so, for example, I have 4 databases on my server...when you
start up the
system with your normal 'postmaster' process, it forks off 4
processes,
one for each database. When you connect to port #### for
database XXXX,
the listening process (main postmaster) shunts the process over to the
appropriate 'traffic cop' for handling...

The 'traffic cop' would keep track of the number of connections to the
database are currently open, and when zero, which woudl indicate idle
time, process through a table in the database to clean it up.
As soon as
a new connection comes in, it would "finish" its cleanup by
making sure
the table is in a 'sane state' (ie. finish up with its
current record) and
then fork off the process, to wait quietly until its idle again...

Then each database could effectively have their own shared memory pool
that could be adjusted on a per database basis. Maybe even
add a 'change
threshold', where after X transactions (update, insert or delete), the
table gets auto-vacuum'd (no analyze, just vacuum)...the
threshold could
be set on a per-table basis...the 'traffic cop' should be
able to easily
keep track of those sort of stats internally...no?

Hell, the 'traffic cop' *should* be able to keep reasonably
accurate stats
to update the same tables that a 'vacuum analyze' maintains, adjusting
those values periodically to give a semi-accurate picture.
Periodically,
a normal 'analyze' would have to be run...

Its a thought...haven't got a clue as to the complexity of
implementing,
but...*shrug*

With MVCC an occasional 'vacuum analyze' should only be noticed from the
performance improvements. As far as I can tell most of the work done by
an analyze is in reading the table data. If you make sure to write the
new information at the end of the transaction you only lock the indexes
for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.
Just my guess,
-DEJ

Import Notes

Resolved by subject fallback

#11

The Hermit Hacker

scrappy@hub.org

about 27 years ago

In reply to: Jackson, DeJuan (#10)

RE: [HACKERS] Re: [PORTS] vacuum takes too long

On Thu, 7 Jan 1999, Jackson, DeJuan wrote:

With MVCC an occasional 'vacuum analyze' should only be noticed from the
performance improvements. As far as I can tell most of the work done by
an analyze is in reading the table data. If you make sure to write the
new information at the end of the transaction you only lock the indexes
for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.

Good point...I seem to recall that at one point, there was a lock imposed
on one of hte pg_ tables when a vacuum is tarted, since it has to update a
couple of the rows in that table...has that lock been removed with MVCC?
Vadim?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#12

Jackson, DeJuan

djackson@cpsgroup.com

about 27 years ago

In reply to: The Hermit Hacker (#11)

RE: [HACKERS] Re: [PORTS] vacuum takes too long

With MVCC an occasional 'vacuum analyze' should only be

noticed from the

performance improvements. As far as I can tell most of the

work done by

an analyze is in reading the table data. If you make sure

to write the

new information at the end of the transaction you only lock

the indexes

for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.

Good point...I seem to recall that at one point, there was a
lock imposed
on one of hte pg_ tables when a vacuum is tarted, since it
has to update a
couple of the rows in that table...has that lock been removed
with MVCC?
Vadim?

Well, even if a vacuum locks whatever 'pg_'table-row that holds the
indexing statistics for the table in question MVCC won't block the
optimizer's reads. As long as there are no more vacuum analyzes run
there shouldn't even be a waiting transaction.
-DEJ

Import Notes

Resolved by subject fallback

#13

Bruce Momjian

maillist@candle.pha.pa.us

about 27 years ago

In reply to: Jackson, DeJuan (#10)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

With MVCC an occasional 'vacuum analyze' should only be noticed from the
performance improvements. As far as I can tell most of the work done by
an analyze is in reading the table data. If you make sure to write the
new information at the end of the transaction you only lock the indexes
for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.

The problem is that vacuum analyze does both vacuum and analyze.
Analyze takes so long, we figured we might as well vacuum too. Maybe we
need to change that.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#14

Jackson, DeJuan

djackson@cpsgroup.com

about 27 years ago

In reply to: Bruce Momjian (#13)

RE: [HACKERS] Re: [PORTS] vacuum takes too long

With MVCC an occasional 'vacuum analyze' should only be

noticed from the

performance improvements. As far as I can tell most of the

work done by

an analyze is in reading the table data. If you make sure

to write the

new information at the end of the transaction you only lock

the indexes

for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.

The problem is that vacuum analyze does both vacuum and analyze.
Analyze takes so long, we figured we might as well vacuum
too. Maybe we
need to change that.

It seems that VACUUM would have to lock most of the rows in a table even
with MVCC; where as ANALYZE can benefit directly from MVCC never
blocking a reader. I for one agree in the separation (hey I always
thought they were separate).
How hard would it be to have VACUUM cheat on the table write? Have
VACUUM take advantage of the fact that we are actually using a file
system file as much as is possible in it's VACUUM. Therefore the actual
moving of the rows could be accomplished in the time it takes to select
all the rows into a new table file and then change the file. There
might be some issues with file-node trouble in the system catalogs, but
those could be taken care of quickly as well. The only things that
you'd have to watch for is a write to the table in the middle of your
reading of the rows. CLUSTER could also use the same system with a
order by on an index. Let me know what you think.
You know what else... MVCC would allow us to ignore updating indexes on
a COPY if combined with the new quicker ANALYZE after the COPY.
Keeping the ideas coming,
-DEJ

Import Notes

Resolved by subject fallback

#15

The Hermit Hacker

scrappy@hub.org

about 27 years ago

In reply to: Bruce Momjian (#13)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

On Thu, 7 Jan 1999, Bruce Momjian wrote:

With MVCC an occasional 'vacuum analyze' should only be noticed from the
performance improvements. As far as I can tell most of the work done by
an analyze is in reading the table data. If you make sure to write the
new information at the end of the transaction you only lock the indexes
for the amount of time it takes to write them.

I see a 'vacuum analyze' being less of a problem than 'vacuum'.
Any of you experts can contradict my assumptions.

The problem is that vacuum analyze does both vacuum and analyze.
Analyze takes so long, we figured we might as well vacuum too. Maybe we
need to change that.

There is, IMHO, no problem with them being combined...in the past, the
problem was that the whole system was effectively locked up while a vacuum
analyze was being run because one of the 'statistics' tables was being
locked during the whole thing, instead of when required...

As DeJuan points out, though, this should no longer be a problem with
MVCC...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#16

Bruce Momjian

maillist@candle.pha.pa.us

over 26 years ago

In reply to: The Hermit Hacker (#9)

Re: [HACKERS] Re: [PORTS] vacuum takes too long

Added to TODO:

* Create a background process for each database that runs while
database is idle, finding superceeded rows, gathering stats and vacuuming

On Wed, 6 Jan 1999, Bruce Momjian wrote:

Wish I knew the answer. I can guess, but that isn't going to help.

Guess == throwing in ideas, even if they are incorrect...the way I
figure it, I through out alot of guesses...some of them spark ideas in
others and we see some really neat ideas come out of it :)

Yes. Good. I have been bugging Vadim about possible row reuse, but I
don't know enough to understand the options.

I'm not sure about the row-reuse thing. What sort of performance hit will
it have. As it is now, you add a row by zipping down to the end, add the
row...bang, finished. with row-reuse, you have to search for a good fit,
which could take time...

Hrmmm...let's look at Oracle's "model"...bear in mind that I haven't dived
very deep into it, so I could be totally off base here, but, with Oracle,
you have a seperate "group" of processes started up for each 'instance',
where, if I'm correct, an instance is the same as our database(?)...

How hard would it be for us to implement something similar? When you
start up the postmaster, it starts up 1 postgres "master process" for each
database that it knows about. The point of the master process is
effectively the garbage collector for the database, as well as the central
'traffic cop'...

so, for example, I have 4 databases on my server...when you start up the
system with your normal 'postmaster' process, it forks off 4 processes,
one for each database. When you connect to port #### for database XXXX,
the listening process (main postmaster) shunts the process over to the
appropriate 'traffic cop' for handling...

The 'traffic cop' would keep track of the number of connections to the
database are currently open, and when zero, which woudl indicate idle
time, process through a table in the database to clean it up. As soon as
a new connection comes in, it would "finish" its cleanup by making sure
the table is in a 'sane state' (ie. finish up with its current record) and
then fork off the process, to wait quietly until its idle again...

Then each database could effectively have their own shared memory pool
that could be adjusted on a per database basis. Maybe even add a 'change
threshold', where after X transactions (update, insert or delete), the
table gets auto-vacuum'd (no analyze, just vacuum)...the threshold could
be set on a per-table basis...the 'traffic cop' should be able to easily
keep track of those sort of stats internally...no?

Hell, the 'traffic cop' *should* be able to keep reasonably accurate stats
to update the same tables that a 'vacuum analyze' maintains, adjusting
those values periodically to give a semi-accurate picture. Periodically,
a normal 'analyze' would have to be run...

Its a thought...haven't got a clue as to the complexity of implementing,
but...*shrug*

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026