vacuum output question

Started by Dan Armbrustover 17 years ago8 messagesgeneral

daniel.armbrust.list@gmail.com

over 17 years ago

I have a system backed by a PostgreSQL DB at a customer site that
mysteriously slowed way down - and couldn't keep up with the load for
no apparent reason.

I had them run a vacuum analyze verbose on my database, and had these
lines come back which made me suspicious:

INFO: index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
DETAIL: 5224 index row versions were removed.
1543 index pages have been deleted, 1373 are currently reusable.
CPU 13.09s/3.51u sec elapsed 157.85 sec.

INFO: index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
DETAIL: 5224 index row versions were removed.
3706 index pages have been deleted, 3291 are currently reusable.
CPU 31.27s/8.22u sec elapsed 687.60 sec.

INFO: "cpe": found 5224 removable, 41626 nonremovable row versions in
1303 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 22416 unused item pointers.
0 pages are entirely empty.
CPU 44.46s/11.82u sec elapsed 852.85 sec.

Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

Can someone tell me what 'CPU 44.46s/11.82u sec' means? I have a
guess, but I'm not sure.

Thanks,

Dan

Scott Marlowe

scott.marlowe@gmail.com

over 17 years ago

In reply to: Dan Armbrust (#1)

Re: vacuum output question

On Thu, Nov 13, 2008 at 4:08 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:

I have a system backed by a PostgreSQL DB at a customer site that
mysteriously slowed way down - and couldn't keep up with the load for
no apparent reason.

I had them run a vacuum analyze verbose on my database, and had these
lines come back which made me suspicious:

INFO: index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
DETAIL: 5224 index row versions were removed.
1543 index pages have been deleted, 1373 are currently reusable.
CPU 13.09s/3.51u sec elapsed 157.85 sec.

INFO: index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
DETAIL: 5224 index row versions were removed.
3706 index pages have been deleted, 3291 are currently reusable.
CPU 31.27s/8.22u sec elapsed 687.60 sec.

INFO: "cpe": found 5224 removable, 41626 nonremovable row versions in
1303 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 22416 unused item pointers.
0 pages are entirely empty.
CPU 44.46s/11.82u sec elapsed 852.85 sec.

That's a fair bit of dead space, but 60k rows isn't really that many.

Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

Assuming pagesize is 8k, then we're talking about scanning 1303*8192
bytes or 10 Megabytes. My laptop can scan that in less than a second.

So, either the hard drive is incredibly fragmented, or there's
something wrong with that machine.

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Dan Armbrust (#1)

Re: vacuum output question

"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:

Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

FWIW, I agree with Scott that you seem to have an overstressed I/O
system. It's hard to tell why from here.

Can someone tell me what 'CPU 44.46s/11.82u sec' means? I have a
guess, but I'm not sure.

That's the vacuum process's system and user CPU-time consumption as
reported by getrusage(2). It's evidently only a minor component of the
elapsed runtime, though, so you need to be looking at I/O costs.

regards, tom lane

Craig Ringer

craig@2ndquadrant.com

over 17 years ago

In reply to: Scott Marlowe (#2)

Re: vacuum output question

Scott Marlowe wrote:

On Thu, Nov 13, 2008 at 4:08 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:

Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

Assuming pagesize is 8k, then we're talking about scanning 1303*8192
bytes or 10 Megabytes. My laptop can scan that in less than a second.

So, either the hard drive is incredibly fragmented, or there's
something wrong with that machine.

If you're using a system with RAID storage, I'd check to make sure
there's no high priority background rebuild or RAID scrubbing going on.
I've had issues in the past where RAID issues have crippled a server to
the point where it's barely able to handle ssh logins.

[Assuming a Linux server:] Naturally it's also worth checking for
processes doing absurd amounts of I/O. See pidstat -d, iostat, vmstat.
Also look for processes constantly in the `D' state in `ps aux' or `top'
which is usually due to heavy I/O. dmesg may also contain useful
information if it's a low level issue.

--
Craig Ringer

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Scott Marlowe (#2)

Re: vacuum output question

On Thu, 2008-11-13 at 17:30 -0700, Scott Marlowe wrote:

I had them run a vacuum analyze verbose on my database, and had these
lines come back which made me suspicious:

INFO: index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
DETAIL: 5224 index row versions were removed.
1543 index pages have been deleted, 1373 are currently reusable.
CPU 13.09s/3.51u sec elapsed 157.85 sec.

INFO: index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
DETAIL: 5224 index row versions were removed.
3706 index pages have been deleted, 3291 are currently reusable.
CPU 31.27s/8.22u sec elapsed 687.60 sec.

INFO: "cpe": found 5224 removable, 41626 nonremovable row versions in
1303 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 22416 unused item pointers.
0 pages are entirely empty.
CPU 44.46s/11.82u sec elapsed 852.85 sec.

That's a fair bit of dead space, but 60k rows isn't really that many.

Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

Assuming pagesize is 8k, then we're talking about scanning 1303*8192
bytes or 10 Megabytes. My laptop can scan that in less than a second.

So, either the hard drive is incredibly fragmented, or there's
something wrong with that machine.

There was concurrent access to the table during VACUUMing, so the long
delay is explainable as long waits for cleanup lock, plus probably
thrashing the cache with bloated indexes. The CPU overhead per row seems
OK. We should instrument the wait time during a VACUUM and report that
also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Dan Armbrust

daniel.armbrust.list@gmail.com

over 17 years ago

In reply to: Craig Ringer (#4)

Re: vacuum output question

Thanks everyone. You have helped back up my suspicions. It is indeed
a Linux system, and it has a RAID IO system, but I don't yet know the
details of that IO system. Time to put them back to work looking at
their hardware, rather than blaming our software :)

Thanks for the extra tips on hunting for excess IO from other processes.

Dan

Dan Armbrust

daniel.armbrust.list@gmail.com

over 17 years ago

In reply to: Simon Riggs (#5)

Re: vacuum output question

There was concurrent access to the table during VACUUMing, so the long
delay is explainable as long waits for cleanup lock, plus probably
thrashing the cache with bloated indexes. The CPU overhead per row seems
OK. We should instrument the wait time during a VACUUM and report that
also.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Is that a guess? Or something you can tell from the log above?
Because there shouldn't have been any concurrent access while the
VACUUM was run - the customers had failed over to a different system,
so while I can't be sure, I expect that there was no other database
activity at the time the command was run.

Thanks,

Dan

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Dan Armbrust (#7)

Re: vacuum output question

On Fri, 2008-11-14 at 09:00 -0600, Dan Armbrust wrote:

There was concurrent access to the table during VACUUMing, so the long
delay is explainable as long waits for cleanup lock, plus probably
thrashing the cache with bloated indexes. The CPU overhead per row seems
OK. We should instrument the wait time during a VACUUM and report that
also.

Is that a guess? Or something you can tell from the log above?

The number of row versions in each index was different after vacuuming.
That tells me some writes occurred and I inferred from that that other
read-only activity occurred as well. Reads or writes will slow down a
VACUUM.

Perhaps you have vacuum_cost_delay set also?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support