Long running INSERT+SELECT query

Started by Vitaliy Garnashevichabout 8 years ago16 messagesgeneral
Jump to latest
#1Vitaliy Garnashevich
vgarnashevich@gmail.com

Hi,

We're trying to populate a table with aggregated data from other tables.
For that we're running a huge INSERT+SELECT query which joins several
tables, aggregates values, and then inserts the results into another
table. The problem we're facing is that while the query is running ,
some records in tables referenced by the results table may be deleted,
which causes the following error:

ERROR:  insert or update on table "..." violates foreign key constraint
"..."
DETAIL:  Key (...)=(...) is not present in table "...".

Who do we make sure that such aggregating query would not fail?

Regards,
Vitaliy

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Vitaliy Garnashevich (#1)
Re: Long running INSERT+SELECT query

On 04/26/2018 09:13 AM, Vitaliy Garnashevich wrote:

Hi,

We're trying to populate a table with aggregated data from other tables.
For that we're running a huge INSERT+SELECT query which joins several
tables, aggregates values, and then inserts the results into another
table. The problem we're facing is that while the query is running ,
some records in tables referenced by the results table may be deleted,
which causes the following error:

ERROR:  insert or update on table "..." violates foreign key constraint
"..."
DETAIL:  Key (...)=(...) is not present in table "...".

SELECT .. FOR UPDATE?:

https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE

https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS

Who do we make sure that such aggregating query would not fail?

Regards,
Vitaliy

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Vitaliy Garnashevich
vgarnashevich@gmail.com
In reply to: Adrian Klaver (#2)
Re: Long running INSERT+SELECT query

SELECT .. FOR UPDATE?:

https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE

https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS

It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR
KEY SHARE cannot be specified with GROUP BY."

Regards,
Vitaliy

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Vitaliy Garnashevich (#3)
Re: Long running INSERT+SELECT query

On 04/26/2018 10:45 AM, Vitaliy Garnashevich wrote:

SELECT .. FOR UPDATE?:

https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE

https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS

It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR
KEY SHARE cannot be specified with GROUP BY."

Without the query we are flying blind, so suggestions will have a ?

Regards,
Vitaliy

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Vitaliy Garnashevich
vgarnashevich@gmail.com
In reply to: Adrian Klaver (#4)
Re: Long running INSERT+SELECT query

Without the query we are flying blind, so suggestions will have a ?

Here is one such query:

    INSERT INTO cmdb_sp_usage_history
      (created_by, updated_by, created_on, updated_on, mod_count,
      summary_on, quarter, product, used_from, "user",
      keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
    SELECT
       2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
       CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
       coalesce(sum("usage"."keystrokes"), 0),
       coalesce(sum("usage"."minutes_in_use"), 0),
       coalesce(sum("usage"."times_started"), 0),
       coalesce(avg("usage"."keystrokes"), 0),
       coalesce(max("usage"."keystrokes"), 0),
       bool_or("cmdb_ci"."operational")
    FROM
      "cmdb_program_daily_usage" "usage"
    LEFT OUTER JOIN
      "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
    LEFT OUTER JOIN
      "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
    LEFT OUTER JOIN
      "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
    LEFT OUTER JOIN
      "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
    WHERE ("usage"."minutes_in_use" > 0)
      AND ((NOT ("s"."software" IS NULL))
           AND ((NOT ("s"."os" = TRUE))
                OR ("s"."os" IS NULL)))
      AND ("usage"."usage_date" >= quarter.start_date)
      AND ("usage"."usage_date" < quarter.end_date)
    GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
    HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
    ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

Regards,
Vitaliy

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Vitaliy Garnashevich (#5)
Re: Long running INSERT+SELECT query

On 04/26/2018 11:08 AM, Vitaliy Garnashevich wrote:

Without the query we are flying blind, so suggestions will have a ?

Here is one such query:

    INSERT INTO cmdb_sp_usage_history
      (created_by, updated_by, created_on, updated_on, mod_count,
      summary_on, quarter, product, used_from, "user",
      keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
    SELECT
       2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
       CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
       coalesce(sum("usage"."keystrokes"), 0),
       coalesce(sum("usage"."minutes_in_use"), 0),
       coalesce(sum("usage"."times_started"), 0),
       coalesce(avg("usage"."keystrokes"), 0),
       coalesce(max("usage"."keystrokes"), 0),
       bool_or("cmdb_ci"."operational")
    FROM
      "cmdb_program_daily_usage" "usage"
    LEFT OUTER JOIN
      "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
    LEFT OUTER JOIN
      "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
    LEFT OUTER JOIN
      "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
    LEFT OUTER JOIN
      "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
    WHERE ("usage"."minutes_in_use" > 0)
      AND ((NOT ("s"."software" IS NULL))
           AND ((NOT ("s"."os" = TRUE))
                OR ("s"."os" IS NULL)))
      AND ("usage"."usage_date" >= quarter.start_date)
      AND ("usage"."usage_date" < quarter.end_date)
    GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
    HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
    ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

Have not worked through all of the above, but a first draft suggestion:

Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.

Regards,
Vitaliy

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Vitaliy Garnashevich
vgarnashevich@gmail.com
In reply to: Adrian Klaver (#6)
Re: Long running INSERT+SELECT query

Have not worked through all of the above, but a first draft suggestion:

Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.

The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not to
fail). I really doubt that row level locking would work. That would be a
lot of RAM just to hold all the locks.

On the other hand, I don't see something like FOR KEY SHARE kind of
locks at table level, so that the query would try not to block most of
other existing activity (e.g. SELECTs, UPDATEs).

Maybe this could be solved by calculating results into a temporary
table, which would not check foreign key constraints, and then copy the
data into the actual results table, while checking each row for FK
consistency and skipping if necessary. But then I don't think it would
be possible for my transaction to see row deletions which other
transactions have done, and to check row existence (the transaction is
there, because the whole thing is implemented as a DO statement with
some local variables).

Thoughts?

Regards,
Vitaliy

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Vitaliy Garnashevich (#7)
Re: Long running INSERT+SELECT query

On Thu, Apr 26, 2018 at 1:32 PM, Vitaliy Garnashevich <
vgarnashevich@gmail.com> wrote:

The aggregated table has hundreds of millions of rows, and the query runs
for many hours (which is one of the reasons why it's better not to fail)

​[...]​

Maybe this could be solved by calculating results into a temporary table,
which would not check foreign key constraints, and then copy the data into
the actual results table, while checking each row for FK consistency and
skipping if necessary.

Me, I'd try very hard to design things so the final calculation goes into
an actual results table that omits FK constraints​ and wouldn't try to
"skip if necessary". You are already running an hours-long query - the
users of said information needs to understand that what they are seeing
does not reflect changes in the subsequent hour(s) since it started and
that certain related records being no longer present doesn't detract from
the fact that they were present "back then" and thus represent valid data
at that point in time.

David J.

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Vitaliy Garnashevich (#7)
Re: Long running INSERT+SELECT query

Vitaliy Garnashevich wrote:

Have not worked through all of the above, but a first draft suggestion:

Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the
sub-query.

The aggregated table has hundreds of millions of rows, and the query runs
for many hours (which is one of the reasons why it's better not to fail). I
really doubt that row level locking would work. That would be a lot of RAM
just to hold all the locks.

Row locks are not stored in memory.

Of course, a FOR KEY SHARE lock would block DELETEs that try to remove
the locked row.

I think your proposed strategy of trying to merge what other processes
did while you were copying is very problematic.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Vitaliy Garnashevich (#7)
Re: Long running INSERT+SELECT query

On 04/26/2018 01:32 PM, Vitaliy Garnashevich wrote:

Have not worked through all of the above, but a first draft suggestion:

Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.

The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not to
fail). I really doubt that row level locking would work. That would be a
lot of RAM just to hold all the locks.

On the other hand, I don't see something like FOR KEY SHARE kind of
locks at table level, so that the query would try not to block most of
other existing activity (e.g. SELECTs, UPDATEs).

Maybe this could be solved by calculating results into a temporary
table, which would not check foreign key constraints, and then copy the
data into the actual results table, while checking each row for FK
consistency and skipping if necessary. But then I don't think it would
be possible for my transaction to see row deletions which other
transactions have done, and to check row existence (the transaction is
there, because the whole thing is implemented as a DO statement with
some local variables).

Thoughts?

The procedure seems to be fighting itself. There is an inherent conflict
between trying to keep up with data changes and presenting a consistent
result. Keeping up means constantly updating the aggregation
calculations which in turn means the result will continually changing.
As David and Alvaro have also suggested your best bet is to pick a point
in time and work off that.

Regards,
Vitaliy

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Steven Lembark
lembark@wrkhors.com
In reply to: Vitaliy Garnashevich (#1)
Re: Long running INSERT+SELECT query

On Thu, 26 Apr 2018 19:13:17 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

We're trying to populate a table with aggregated data from other
tables. For that we're running a huge INSERT+SELECT query which joins
several tables, aggregates values, and then inserts the results into
another table. The problem we're facing is that while the query is
running , some records in tables referenced by the results table may
be deleted, which causes the following error:

ERROR:  insert or update on table "..." violates foreign key
constraint "..."
DETAIL:  Key (...)=(...) is not present in table "...".

Who do we make sure that such aggregating query would not fail?

Create a temporary table with a useful subset of the data.

You can select the mininimum number of columns joined and release
the locks. This can also help large queries by giving you a stable
snapshot of the data for repeated queries.

I usually find that pre-joining the tables is easier because
temp tables have restrictions on re-use w/in the query, and also
usually don't have indexes to speed up the joins.

If you are going to run this, say, daily it's easy enough to create
a view and just "create temporary table foo as select * from bar"
for some collection of views and go from there. This makes it easier
to tune the queries on the back end without having to hack the front
end code.

--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508

#12Steven Lembark
lembark@wrkhors.com
In reply to: Vitaliy Garnashevich (#5)
Re: Long running INSERT+SELECT query

On Thu, 26 Apr 2018 21:08:01 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

    INSERT INTO cmdb_sp_usage_history
      (created_by, updated_by, created_on, updated_on, mod_count,
      summary_on, quarter, product, used_from, "user",
      keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
    SELECT
       2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
       CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
       coalesce(sum("usage"."keystrokes"), 0),
       coalesce(sum("usage"."minutes_in_use"), 0),
       coalesce(sum("usage"."times_started"), 0),
       coalesce(avg("usage"."keystrokes"), 0),
       coalesce(max("usage"."keystrokes"), 0),
       bool_or("cmdb_ci"."operational")
    FROM
      "cmdb_program_daily_usage" "usage"
    LEFT OUTER JOIN
      "cmdb_program_instance" "p" ON "p"."id" =
"usage"."program_instance" LEFT OUTER JOIN
      "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
    LEFT OUTER JOIN
      "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
    LEFT OUTER JOIN
      "cmdb_software_product_version" "spv" ON "spv"."id" =
"s"."software" WHERE ("usage"."minutes_in_use" > 0)
      AND ((NOT ("s"."software" IS NULL))
           AND ((NOT ("s"."os" = TRUE))
                OR ("s"."os" IS NULL)))
      AND ("usage"."usage_date" >= quarter.start_date)
      AND ("usage"."usage_date" < quarter.end_date)
    GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
    HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
    ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

create temporary table
foobar

select
<unaggregatedl, un-coalesced data>
from
<join from hell, above, sans group by>
;

This isolates the lock time to performing the bare select, after
which you can coalesce and sum to your heart's content without
locking any of it.

The point is performing the absolute minimum of processing to
generate the temp table so as to release any locks quickly and
avoid "group by" in the main join.

Yes, this might end up creating a large-ish temp table :-)

One other approach would be selecting only incremental data
(e.g., daily) which locks a much smaller subset of the rows
and aggregating the daily totals into quarterly, whatever.
Call it daily usage, select where usage_date = today's or
timestamp && a tstzrange of ( 0000, 2400, [) ). That might
also simplify your query logic: all the coalesce op's end up
in your daily/weekly/monthly/whatever summary, the quarterly
values in the reporting are just sum X group by.

--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508

#13Steven Lembark
lembark@wrkhors.com
In reply to: Vitaliy Garnashevich (#7)
Re: Long running INSERT+SELECT query

On Thu, 26 Apr 2018 23:32:33 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not
to fail). I really doubt that row level locking would work. That
would be a lot of RAM just to hold all the locks.

All the more reason to use temp tables.

--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508

#14Vitaliy Garnashevich
vgarnashevich@gmail.com
In reply to: Steven Lembark (#13)
Re: Long running INSERT+SELECT query

Everybody thanks for the suggestions!

We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.

If this will not help, we'll also try to use triggers, as mentioned
here: https://stackoverflow.com/a/39828119

Regards,
Vitaliy

#15Steven Lembark
lembark@wrkhors.com
In reply to: Vitaliy Garnashevich (#14)
Re: Long running INSERT+SELECT query

On Fri, 27 Apr 2018 19:38:15 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.

Thing about using a couple of Materialized Views for the worst
part of it.

--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508

#16Tim Cross
theophilusx@gmail.com
In reply to: Steven Lembark (#15)
Re: Long running INSERT+SELECT query

Steven Lembark <lembark@wrkhors.com> writes:

On Fri, 27 Apr 2018 19:38:15 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.

Thing about using a couple of Materialized Views for the worst
part of it.

+1 re: materialised views - I have found them to be extremely useful for
situations where you want a snapshot of data and need to present it in a
way which is easier to process, especially when the underlying data is
changing faster than your reporting process can generate the report.

--
Tim Cross