how to monitor the progress of really large bulk operations?

Started by Mike Sofenover 9 years ago6 messagesgeneral

msofen@runbox.com

over 9 years ago

Hi gang,

On PG 9.5.1, linux, I'm running some large ETL operations, migrate data from
a legacy mysql system into PG, upwards of 250m rows in a transaction (it's
on a big box). It's always a 2 step operation - extract raw mysql data and
pull it to the target big box into staging tables that match the source, the
second step being read the landed dataset and transform it into the final
formats, linking to newly generated ids, compressing big subsets into jsonb
documents, etc.

While I could break it into smaller chunks, it hasn't been necessary, and it
doesn't eliminate my need: how to view the state of a transaction in
flight, seeing how many rows have been read or inserted (possible for a
transaction in flight?), memory allocations across the various PG processes,
etc.

Possible or a hallucination?

Mike Sofen (Synthetic Genomics)

Melvin Davidson

melvin6925@gmail.com

over 9 years ago

In reply to: Mike Sofen (#1)

Re: how to monitor the progress of really large bulk operations?

On Tue, Sep 27, 2016 at 5:03 PM, Mike Sofen <msofen@runbox.com> wrote:

Hi gang,

On PG 9.5.1, linux, I’m running some large ETL operations, migrate data
from a legacy mysql system into PG, upwards of 250m rows in a transaction
(it’s on a big box). It’s always a 2 step operation – extract raw mysql
data and pull it to the target big box into staging tables that match the
source, the second step being read the landed dataset and transform it into
the final formats, linking to newly generated ids, compressing big subsets
into jsonb documents, etc.

While I could break it into smaller chunks, it hasn’t been necessary, and
it doesn’t eliminate my need: how to view the state of a transaction in
flight, seeing how many rows have been read or inserted (possible for a
transaction in flight?), memory allocations across the various PG
processes, etc.

Possible or a hallucination?

Mike Sofen (Synthetic Genomics)

AFAIK, it is not currently possible to monitor the progress/status of a
query. However, I do see that this is planned for "sometime in the future".
*https://postgresql.uservoice.com/forums/21853-general/suggestions/249745-query-progress-reported-by-percentage-in-pg-stat
<https://postgresql.uservoice.com/forums/21853-general/suggestions/249745-query-progress-reported-by-percentage-in-pg-stat>*

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

pavel.stehule@gmail.com

over 9 years ago

In reply to: Mike Sofen (#1)

Re: how to monitor the progress of really large bulk operations?

Hi

2016-09-27 23:03 GMT+02:00 Mike Sofen <msofen@runbox.com>:

Hi gang,

On PG 9.5.1, linux, I’m running some large ETL operations, migrate data
from a legacy mysql system into PG, upwards of 250m rows in a transaction
(it’s on a big box). It’s always a 2 step operation – extract raw mysql
data and pull it to the target big box into staging tables that match the
source, the second step being read the landed dataset and transform it into
the final formats, linking to newly generated ids, compressing big subsets
into jsonb documents, etc.

While I could break it into smaller chunks, it hasn’t been necessary, and
it doesn’t eliminate my need: how to view the state of a transaction in
flight, seeing how many rows have been read or inserted (possible for a
transaction in flight?), memory allocations across the various PG
processes, etc.

Possible or a hallucination?

Mike Sofen (Synthetic Genomics)

some years ago I used a trick
http://okbob.blogspot.cz/2014/09/nice-unix-filter-pv.html#links

Regards

Pavel

pavel.stehule@gmail.com

over 9 years ago

In reply to: Pavel Stehule (#3)

Re: how to monitor the progress of really large bulk operations?

2016-09-28 6:13 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:

Hi

2016-09-27 23:03 GMT+02:00 Mike Sofen <msofen@runbox.com>:

Hi gang,

On PG 9.5.1, linux, I’m running some large ETL operations, migrate data
from a legacy mysql system into PG, upwards of 250m rows in a transaction
(it’s on a big box). It’s always a 2 step operation – extract raw mysql
data and pull it to the target big box into staging tables that match the
source, the second step being read the landed dataset and transform it into
the final formats, linking to newly generated ids, compressing big subsets
into jsonb documents, etc.

While I could break it into smaller chunks, it hasn’t been necessary, and
it doesn’t eliminate my need: how to view the state of a transaction in
flight, seeing how many rows have been read or inserted (possible for a
transaction in flight?), memory allocations across the various PG
processes, etc.

Possible or a hallucination?

Mike Sofen (Synthetic Genomics)

some years ago I used a trick http://okbob.blogspot.cz/2014/
09/nice-unix-filter-pv.html#links

pltoolbox has counter function
https://github.com/okbob/pltoolbox/blob/master/utils.c

pavel=# insert into omega2 select (x.xx).*
from (select pst.counter(omega,200000, true) xx
from omega
) x;
NOTICE: processed 200000 rows, current value is '(5,8)'
NOTICE: processed 200000 rows, current value is '(5,8)'
NOTICE: processed 400000 rows, current value is '(6,8)'
NOTICE: processed 400000 rows, current value is '(6,8)'
NOTICE: processed 600000 rows, current value is '(7,8)'
NOTICE: processed 600000 rows, current value is '(7,8)'
NOTICE: processed 800000 rows, current value is '(1,8)'
NOTICE: processed 800000 rows, current value is '(1,8)'
NOTICE: processed 1000000 rows, current value is '(5,8)'
NOTICE: processed 1000000 rows, current value is '(5,8)'

Show quoted text

Regards

Pavel

msofen@runbox.com

over 9 years ago

In reply to: Pavel Stehule (#4)

Re: how to monitor the progress of really large bulk operations?

From: Pavel Stehule Sent: Tuesday, September 27, 2016 9:18 PM
2016-09-28 6:13 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com> >:

Hi

2016-09-27 23:03 GMT+02:00 Mike Sofen <msofen@runbox.com <mailto:msofen@runbox.com> >:

Hi gang,

how to view the state of a transaction in flight, seeing how many rows have been read or inserted (possible for a transaction in flight?), memory allocations across the various PG processes, etc.

some years ago I used a trick http://okbob.blogspot.cz/2014/09/nice-unix-filter-pv.html#links

pltoolbox has counter function https://github.com/okbob/pltoolbox/blob/master/utils.c

pavel=# insert into omega2 select (x.xx).*
from (select pst.counter(omega,200000, true) xx
from omega
) x;
NOTICE: processed 200000 rows, current value is '(5,8)'
NOTICE: processed 200000 rows, current value is '(5,8)'

Regards

Pavel

Pavel - That’s a very interesting function and thanks for sharing your toolbox. The big question of course, is what is the impact on performance, scalability and stability? Would it work inside of a stored function that would allow me write out the progress to a tracking table?

Mike

pavel.stehule@gmail.com

over 9 years ago

In reply to: Mike Sofen (#5)

Re: how to monitor the progress of really large bulk operations?

2016-09-28 14:34 GMT+02:00 Mike Sofen <msofen@runbox.com>:

*From:* Pavel Stehule *Sent:* Tuesday, September 27, 2016 9:18 PM
2016-09-28 6:13 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:

Hi

2016-09-27 23:03 GMT+02:00 Mike Sofen <msofen@runbox.com>:

Hi gang,

how to view the state of a transaction in flight, seeing how many rows
have been read or inserted (possible for a transaction in flight?), memory
allocations across the various PG processes, etc.

some years ago I used a trick http://okbob.blogspot.cz/2014/
09/nice-unix-filter-pv.html#links

pltoolbox has counter function https://github.com/okbob/
pltoolbox/blob/master/utils.c

pavel=# insert into omega2 select (x.xx).*

from (select pst.counter(omega,200000, true) xx

from omega

) x;

NOTICE: processed 200000 rows, current value is '(5,8)'

NOTICE: processed 200000 rows, current value is '(5,8)'

Regards

Pavel

Pavel - That’s a very interesting function and thanks for sharing your
toolbox. The big question of course, is what is the impact on performance,
scalability and stability? Would it work inside of a stored function that
would allow me write out the progress to a tracking table?

When a IO is bottleneck then counter has zero overhead. The usage is same
as any PostgreSQL set returning function. This function should be stable -
it is pretty simple

Regards

Pavel

Show quoted text

Mike