Long running INSERT+SELECT query
Hi,
We're trying to populate a table with aggregated data from other tables.
For that we're running a huge INSERT+SELECT query which joins several
tables, aggregates values, and then inserts the results into another
table. The problem we're facing is that while the query is running ,
some records in tables referenced by the results table may be deleted,
which causes the following error:
ERROR: insert or update on table "..." violates foreign key constraint
"..."
DETAIL: Key (...)=(...) is not present in table "...".
Who do we make sure that such aggregating query would not fail?
Regards,
Vitaliy
On 04/26/2018 09:13 AM, Vitaliy Garnashevich wrote:
Hi,
We're trying to populate a table with aggregated data from other tables.
For that we're running a huge INSERT+SELECT query which joins several
tables, aggregates values, and then inserts the results into another
table. The problem we're facing is that while the query is running ,
some records in tables referenced by the results table may be deleted,
which causes the following error:ERROR: insert or update on table "..." violates foreign key constraint
"..."
DETAIL: Key (...)=(...) is not present in table "...".
SELECT .. FOR UPDATE?:
https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE
https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS
Who do we make sure that such aggregating query would not fail?
Regards,
Vitaliy
--
Adrian Klaver
adrian.klaver@aklaver.com
SELECT .. FOR UPDATE?:
https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE
https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS
It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR
KEY SHARE cannot be specified with GROUP BY."
Regards,
Vitaliy
On 04/26/2018 10:45 AM, Vitaliy Garnashevich wrote:
SELECT .. FOR UPDATE?:
https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE
https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS
It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR
KEY SHARE cannot be specified with GROUP BY."
Without the query we are flying blind, so suggestions will have a ?
Regards,
Vitaliy
--
Adrian Klaver
adrian.klaver@aklaver.com
Without the query we are flying blind, so suggestions will have a ?
Here is one such query:
INSERT INTO cmdb_sp_usage_history
(created_by, updated_by, created_on, updated_on, mod_count,
summary_on, quarter, product, used_from, "user",
keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
SELECT
2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
coalesce(sum("usage"."keystrokes"), 0),
coalesce(sum("usage"."minutes_in_use"), 0),
coalesce(sum("usage"."times_started"), 0),
coalesce(avg("usage"."keystrokes"), 0),
coalesce(max("usage"."keystrokes"), 0),
bool_or("cmdb_ci"."operational")
FROM
"cmdb_program_daily_usage" "usage"
LEFT OUTER JOIN
"cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
LEFT OUTER JOIN
"cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
LEFT OUTER JOIN
"cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
LEFT OUTER JOIN
"cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
WHERE ("usage"."minutes_in_use" > 0)
AND ((NOT ("s"."software" IS NULL))
AND ((NOT ("s"."os" = TRUE))
OR ("s"."os" IS NULL)))
AND ("usage"."usage_date" >= quarter.start_date)
AND ("usage"."usage_date" < quarter.end_date)
GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
ORDER BY "spv"."product", "usage"."used_from", "usage"."user";
Regards,
Vitaliy
On 04/26/2018 11:08 AM, Vitaliy Garnashevich wrote:
Without the query we are flying blind, so suggestions will have a ?
Here is one such query:
INSERT INTO cmdb_sp_usage_history
(created_by, updated_by, created_on, updated_on, mod_count,
summary_on, quarter, product, used_from, "user",
keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
SELECT
2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
coalesce(sum("usage"."keystrokes"), 0),
coalesce(sum("usage"."minutes_in_use"), 0),
coalesce(sum("usage"."times_started"), 0),
coalesce(avg("usage"."keystrokes"), 0),
coalesce(max("usage"."keystrokes"), 0),
bool_or("cmdb_ci"."operational")
FROM
"cmdb_program_daily_usage" "usage"
LEFT OUTER JOIN
"cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
LEFT OUTER JOIN
"cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
LEFT OUTER JOIN
"cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
LEFT OUTER JOIN
"cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
WHERE ("usage"."minutes_in_use" > 0)
AND ((NOT ("s"."software" IS NULL))
AND ((NOT ("s"."os" = TRUE))
OR ("s"."os" IS NULL)))
AND ("usage"."usage_date" >= quarter.start_date)
AND ("usage"."usage_date" < quarter.end_date)
GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
ORDER BY "spv"."product", "usage"."used_from", "usage"."user";
Have not worked through all of the above, but a first draft suggestion:
Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.
Regards,
Vitaliy
--
Adrian Klaver
adrian.klaver@aklaver.com
Have not worked through all of the above, but a first draft suggestion:
Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.
The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not to
fail). I really doubt that row level locking would work. That would be a
lot of RAM just to hold all the locks.
On the other hand, I don't see something like FOR KEY SHARE kind of
locks at table level, so that the query would try not to block most of
other existing activity (e.g. SELECTs, UPDATEs).
Maybe this could be solved by calculating results into a temporary
table, which would not check foreign key constraints, and then copy the
data into the actual results table, while checking each row for FK
consistency and skipping if necessary. But then I don't think it would
be possible for my transaction to see row deletions which other
transactions have done, and to check row existence (the transaction is
there, because the whole thing is implemented as a DO statement with
some local variables).
Thoughts?
Regards,
Vitaliy
On Thu, Apr 26, 2018 at 1:32 PM, Vitaliy Garnashevich <
vgarnashevich@gmail.com> wrote:
The aggregated table has hundreds of millions of rows, and the query runs
for many hours (which is one of the reasons why it's better not to fail)
[...]
Maybe this could be solved by calculating results into a temporary table,
which would not check foreign key constraints, and then copy the data into
the actual results table, while checking each row for FK consistency and
skipping if necessary.
Me, I'd try very hard to design things so the final calculation goes into
an actual results table that omits FK constraints and wouldn't try to
"skip if necessary". You are already running an hours-long query - the
users of said information needs to understand that what they are seeing
does not reflect changes in the subsequent hour(s) since it started and
that certain related records being no longer present doesn't detract from
the fact that they were present "back then" and thus represent valid data
at that point in time.
David J.
Vitaliy Garnashevich wrote:
Have not worked through all of the above, but a first draft suggestion:
Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the
sub-query.The aggregated table has hundreds of millions of rows, and the query runs
for many hours (which is one of the reasons why it's better not to fail). I
really doubt that row level locking would work. That would be a lot of RAM
just to hold all the locks.
Row locks are not stored in memory.
Of course, a FOR KEY SHARE lock would block DELETEs that try to remove
the locked row.
I think your proposed strategy of trying to merge what other processes
did while you were copying is very problematic.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/26/2018 01:32 PM, Vitaliy Garnashevich wrote:
Have not worked through all of the above, but a first draft suggestion:
Move the SELECT minus the aggregation functions into a sub-query that
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not to
fail). I really doubt that row level locking would work. That would be a
lot of RAM just to hold all the locks.On the other hand, I don't see something like FOR KEY SHARE kind of
locks at table level, so that the query would try not to block most of
other existing activity (e.g. SELECTs, UPDATEs).Maybe this could be solved by calculating results into a temporary
table, which would not check foreign key constraints, and then copy the
data into the actual results table, while checking each row for FK
consistency and skipping if necessary. But then I don't think it would
be possible for my transaction to see row deletions which other
transactions have done, and to check row existence (the transaction is
there, because the whole thing is implemented as a DO statement with
some local variables).Thoughts?
The procedure seems to be fighting itself. There is an inherent conflict
between trying to keep up with data changes and presenting a consistent
result. Keeping up means constantly updating the aggregation
calculations which in turn means the result will continually changing.
As David and Alvaro have also suggested your best bet is to pick a point
in time and work off that.
Regards,
Vitaliy
--
Adrian Klaver
adrian.klaver@aklaver.com
On Thu, 26 Apr 2018 19:13:17 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
We're trying to populate a table with aggregated data from other
tables. For that we're running a huge INSERT+SELECT query which joins
several tables, aggregates values, and then inserts the results into
another table. The problem we're facing is that while the query is
running , some records in tables referenced by the results table may
be deleted, which causes the following error:ERROR: insert or update on table "..." violates foreign key
constraint "..."
DETAIL: Key (...)=(...) is not present in table "...".Who do we make sure that such aggregating query would not fail?
Create a temporary table with a useful subset of the data.
You can select the mininimum number of columns joined and release
the locks. This can also help large queries by giving you a stable
snapshot of the data for repeated queries.
I usually find that pre-joining the tables is easier because
temp tables have restrictions on re-use w/in the query, and also
usually don't have indexes to speed up the joins.
If you are going to run this, say, daily it's easy enough to create
a view and just "create temporary table foo as select * from bar"
for some collection of views and go from there. This makes it easier
to tune the queries on the back end without having to hack the front
end code.
--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508
On Thu, 26 Apr 2018 21:08:01 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
INSERT INTO cmdb_sp_usage_history
(created_by, updated_by, created_on, updated_on, mod_count,
summary_on, quarter, product, used_from, "user",
keystrokes, minutes_in_use, times_started, avg_keystrokes,
max_keystrokes, spkg_operational)
SELECT
2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
CURRENT_TIMESTAMP, quarter.id, "spv"."product",
"usage"."used_from", "usage"."user",
coalesce(sum("usage"."keystrokes"), 0),
coalesce(sum("usage"."minutes_in_use"), 0),
coalesce(sum("usage"."times_started"), 0),
coalesce(avg("usage"."keystrokes"), 0),
coalesce(max("usage"."keystrokes"), 0),
bool_or("cmdb_ci"."operational")
FROM
"cmdb_program_daily_usage" "usage"
LEFT OUTER JOIN
"cmdb_program_instance" "p" ON "p"."id" =
"usage"."program_instance" LEFT OUTER JOIN
"cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
LEFT OUTER JOIN
"cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
LEFT OUTER JOIN
"cmdb_software_product_version" "spv" ON "spv"."id" =
"s"."software" WHERE ("usage"."minutes_in_use" > 0)
AND ((NOT ("s"."software" IS NULL))
AND ((NOT ("s"."os" = TRUE))
OR ("s"."os" IS NULL)))
AND ("usage"."usage_date" >= quarter.start_date)
AND ("usage"."usage_date" < quarter.end_date)
GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
(coalesce(sum("usage"."times_started"), 0) > 0)
ORDER BY "spv"."product", "usage"."used_from", "usage"."user";
create temporary table
foobar
select
<unaggregatedl, un-coalesced data>
from
<join from hell, above, sans group by>
;
This isolates the lock time to performing the bare select, after
which you can coalesce and sum to your heart's content without
locking any of it.
The point is performing the absolute minimum of processing to
generate the temp table so as to release any locks quickly and
avoid "group by" in the main join.
Yes, this might end up creating a large-ish temp table :-)
One other approach would be selecting only incremental data
(e.g., daily) which locks a much smaller subset of the rows
and aggregating the daily totals into quarterly, whatever.
Call it daily usage, select where usage_date = today's or
timestamp && a tstzrange of ( 0000, 2400, [) ). That might
also simplify your query logic: all the coalesce op's end up
in your daily/weekly/monthly/whatever summary, the quarterly
values in the reporting are just sum X group by.
--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508
On Thu, 26 Apr 2018 23:32:33 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
The aggregated table has hundreds of millions of rows, and the query
runs for many hours (which is one of the reasons why it's better not
to fail). I really doubt that row level locking would work. That
would be a lot of RAM just to hold all the locks.
All the more reason to use temp tables.
--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508
Everybody thanks for the suggestions!
We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.
If this will not help, we'll also try to use triggers, as mentioned
here: https://stackoverflow.com/a/39828119
Regards,
Vitaliy
On Fri, 27 Apr 2018 19:38:15 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.
Thing about using a couple of Materialized Views for the worst
part of it.
--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508
Steven Lembark <lembark@wrkhors.com> writes:
On Fri, 27 Apr 2018 19:38:15 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of
the tables, which are referenced by results, before running the big
query. That should be up to a million of rows in total. It will probably
not cover the case when a record is INSERT'ed and then DELETE'd after
the calculation has begun, but such cases should be even more rare than
the DELETE's we're currently facing.Thing about using a couple of Materialized Views for the worst
part of it.
+1 re: materialised views - I have found them to be extremely useful for
situations where you want a snapshot of data and need to present it in a
way which is easier to process, especially when the underlying data is
changing faster than your reporting process can generate the report.
--
Tim Cross