9.3 feature proposal: vacuumdb -j #

Started by Josh Berkusalmost 14 years ago13 messages
#1Josh Berkus
josh@agliodbs.com

Hackers,

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores. And I needed to know when
the analyze was done, because it was part of a downtime. I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

Any reason not to hack on this for 9.3?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#2Jan Lentfer
Jan.Lentfer@web.de
In reply to: Josh Berkus (#1)
Re: 9.3 feature proposal: vacuumdb -j #

Am 13.01.2012 22:50, schrieb Josh Berkus:

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores. And I needed to know when
the analyze was done, because it was part of a downtime. I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

Any reason not to hack on this for 9.3?

I don't see any reason not to do it, but plenty to do it.
Right now I have systems hosting many databases, I need to vacuum full
from time to time. I have wrapped vacuumdb with a shell script to
actually use all the capacity that is available. A vacuumdb -faz just
isn't that usefull on large machines anymore.

Jan

#3Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#1)
Re: 9.3 feature proposal: vacuumdb -j #

On Friday, January 13, 2012 10:50:32 PM Josh Berkus wrote:

Hackers,

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores. And I needed to know when
the analyze was done, because it was part of a downtime. I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

That doesn't sound like a good idea. Its way too likely that you will end up
with one backend doing all the work because it got some big tables.

I don't think this task deserves using threads or subprocesses. Multiple
connections from one process seems way more sensible and mostly avoids the
above problem.

Andres

In reply to: Josh Berkus (#1)
Re: 9.3 feature proposal: vacuumdb -j #

On 13-01-2012 18:50, Josh Berkus wrote:

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

It is in the mid of my TODO list. reindexdb is in the plans too.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#5Christopher Browne
cbbrowne@gmail.com
In reply to: Josh Berkus (#1)
Re: 9.3 feature proposal: vacuumdb -j #

On Fri, Jan 13, 2012 at 4:50 PM, Josh Berkus <josh@agliodbs.com> wrote:

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores.   And I needed to know when
the analyze was done, because it was part of a downtime.  I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

I think "simplest" isn't *quite* best...

There's the risk that all the big tables get tied to one child, and so
the one child is doing them serially.

Better:

Have two logical tasks:
a) A process that manages the list, and
b) Child processes doing vacuums.

Each time a child completes a table, it asks the parent for another one.

So the tendency will be that if there are 8 big tables, and 12 child
processes, it's *certain* that the 8 big tables will be spread across
the children.

It guarantees that the child processes will all be busy until there
are fewer tables left than there are child processes.

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#6Josh Berkus
josh@agliodbs.com
In reply to: Euler Taveira de Oliveira (#4)
Re: 9.3 feature proposal: vacuumdb -j #

On 1/13/12 2:12 PM, Euler Taveira de Oliveira wrote:

On 13-01-2012 18:50, Josh Berkus wrote:

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

It is in the mid of my TODO list. reindexdb is in the plans too.

I'm even happier to have someone else do it. ;-)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#7Susanne Ebrecht
susanne@2ndquadrant.com
In reply to: Josh Berkus (#1)
Re: 9.3 feature proposal: vacuumdb -j #

Am 13.01.2012 22:50, schrieb Josh Berkus:

Hackers,

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores. And I needed to know when
the analyze was done, because it was part of a downtime. I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

Any reason not to hack on this for 9.3?

Hello,

I like the idea - but ...
I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Default is total core number of the machine but users should be able to
say - ok -
my machine has 24 cores but I want that vacuumdb just will use 12 of them.

Especially on startups - you are able to find machines that aren't
database-only
machines. Often you find database and web server as single machine.

Also you could have run more cluster on same machine for offering your
business in
different languages (one cluster per language). I already saw such a setup.

There might be side businesses on the cores - so it should be possible
that the
users decides on how much cores he wants to share vacuumdb.

Susanne

--
Dipl. Inf. Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com

#8Andres Freund
andres@anarazel.de
In reply to: Susanne Ebrecht (#7)
Re: 9.3 feature proposal: vacuumdb -j #

On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:

I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Uhm. -j # does exactly that or am I missing your point?

Andres

#9Jaime Casanova
jaime@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: 9.3 feature proposal: vacuumdb -j #

On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:

On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:

I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Uhm. -j # does exactly that or am I missing your point?

not really.

if you have 12 cores and you say -j 12 you would have 1 process per
core, with Susanne's suggestion, AFAIUI, you can say -j 12
--shared-cores=6... so you would only use 6 cores of the 12 and have 2
processes per core

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Jaime Casanova (#9)
Re: 9.3 feature proposal: vacuumdb -j #

On 01/17/2012 07:33 AM, Jaime Casanova wrote:

On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund<andres@anarazel.de> wrote:

On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:

I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Uhm. -j # does exactly that or am I missing your point?

not really.

if you have 12 cores and you say -j 12 you would have 1 process per
core, with Susanne's suggestion, AFAIUI, you can say -j 12
--shared-cores=6... so you would only use 6 cores of the 12 and have 2
processes per core

That looks messy. IMNSHO it should work just like pg_restore's -j.

cheers

andrew

#11Andres Freund
andres@anarazel.de
In reply to: Jaime Casanova (#9)
Re: 9.3 feature proposal: vacuumdb -j #

On Tuesday, January 17, 2012 01:33:06 PM Jaime Casanova wrote:

On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:

On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:

I would prefer to have an option that the user is able to tell on how
much cores it should be shared. Something like --share-cores=N.

Uhm. -j # does exactly that or am I missing your point?

not really.

if you have 12 cores and you say -j 12 you would have 1 process per
core, with Susanne's suggestion, AFAIUI, you can say -j 12
--shared-cores=6... so you would only use 6 cores of the 12 and have 2
processes per core

I don't really get what that should do. If vacuumdb itself is a limit in any
form in this we did something *very* wrong (in my opinion using processes for
this is pointless anyway. Using async queries seems to be much easier for this
special case. Especially for distributing individual commands.).
I don't really see how you could enforce sharing cores on the server side
(well, there are cpusets, but were sure not introduce usage of that just for
vacuumdb).

Andres

#12Jim Nasby
jim@nasby.net
In reply to: Christopher Browne (#5)
Re: 9.3 feature proposal: vacuumdb -j #

On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:

Have two logical tasks:
a) A process that manages the list, and
b) Child processes doing vacuums.

Each time a child completes a table, it asks the parent for another one.

There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing.

CREATE TEMP SEQUENCE s;
SELECT relname, s mod <number of backends> AS backend_number
FROM ( SELECT relname
FROM pg_class
ORDER BY relpages
);

Of course, having an actual scheduling process is most likely the most efficient.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#13Andrew Dunstan
andrew@dunslane.net
In reply to: Jim Nasby (#12)
Re: 9.3 feature proposal: vacuumdb -j #

On 01/17/2012 07:09 PM, Jim Nasby wrote:

On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:

Have two logical tasks:
a) A process that manages the list, and
b) Child processes doing vacuums.

Each time a child completes a table, it asks the parent for another one.

There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing.

CREATE TEMP SEQUENCE s;
SELECT relname, s mod<number of backends> AS backend_number
FROM ( SELECT relname
FROM pg_class
ORDER BY relpages
);

Of course, having an actual scheduling process is most likely the most efficient.

We already have a model for this in parallel pg_restore. It would
probably not be terribly hard to adapt to parallel vacuum.

cheers

andrew