Running update in chunks?
I have seen a lot of slow update questions asked both here and on
stack overflow but they usually involve large tables. In my case the
dataset is kind of small.
I have an app in which I import data and then merge the records with
an existing table. Currently I do most of the heavy lifting with code
and it works well enough but as the imports get bigger I thought I
would rewrite the code to speed it up using postgres. Basically I get
the data which I consider to be dirty and I put it into a table using
hstore to store the data. I then run a series of update queries to
locate the "real" records in the various tables. The import data
looks like this https://gist.github.com/4584366 and has about 98K
records in it. The lookup table is very small only a couple of hundred
records in it.
This is the query I am running
update cars.imports i
set make_id = md.make_id
from cars.models md where i.model_id = md.id;
Here is the analyse
"Update on imports i (cost=2.46..49720.34 rows=138858 width=526)
(actual time=51968.553..51968.553 rows=0 loops=1)"
" -> Hash Join (cost=2.46..49720.34 rows=138858 width=526) (actual
time=0.044..408.170 rows=98834 loops=1)"
" Hash Cond: (i.model_id = md.id)"
" -> Seq Scan on imports i (cost=0.00..47808.58 rows=138858
width=516) (actual time=0.010..323.616 rows=98834 loops=1)"
" -> Hash (cost=1.65..1.65 rows=65 width=14) (actual
time=0.026..0.026 rows=65 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 4kB"
" -> Seq Scan on models md (cost=0.00..1.65 rows=65
width=14) (actual time=0.002..0.012 rows=65 loops=1)"
"Total runtime: 51968.602 ms"
This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive. I am using postgres 9.2
installed with homebrew using the standard conf file.
So it seems to me that this query is running as fast as it could but
it's still much slower than doing things with code one record at a
time (using some memoization).
Anyway... Presuming I can't really do anything to speed up this query
does it make sense to try and do this in chunks and if so what is the
best technique for doing that.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 21/01/13 08:04, Tim Uckun wrote:
This is the query I am running
update cars.imports i
set make_id = md.make_id
from cars.models md where i.model_id = md.id;Here is the analyse
Looks like it's the actual update that's taking all the time.
This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive. I am using postgres 9.2
installed with homebrew using the standard conf file.
Can you try a couple of things just to check timings. Probably worth
EXPLAIN ANALYSE.
SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;
CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models
md ON i.model_id = md.id;
Now the first one should take half a second judging by your previous
explain. If the second one takes 50 seconds too then that's just the
limit of your SSD's write. If it's much faster then something else is
happening.
--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Can you try a couple of things just to check timings. Probably worth EXPLAIN
ANALYSE.SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;
Takes about 300 ms
CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
ON i.model_id = md.id;
Takes about 300 ms
Now the first one should take half a second judging by your previous
explain. If the second one takes 50 seconds too then that's just the limit
of your SSD's write. If it's much faster then something else is happening.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 21/01/13 10:30, Tim Uckun wrote:
Can you try a couple of things just to check timings. Probably worth EXPLAIN
ANALYSE.SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;Takes about 300 ms
CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
ON i.model_id = md.id;Takes about 300 ms
OK - so writing all the data takes very under one second but updating
the same amount takes 50 seconds.
The only differences I can think of are WAL logging (transaction log)
and index updates (the temp table has no indexes).
1. Try CREATE TABLE rather than CREATE TEMP TABLE - if that's still
quick then it's not the time taken to write WAL.
2. Run the update query against your new tt table and see how long that
takes.
3. Add indexes and repeat (in particular I'd be suspicious of the gin
index on "data")
My guess is that it's the time taken to update the "data" index - gin
indexes can be slow to rebuild (although 50 seconds seems *very* slow).
If so there are a few options:
1. Split the table and put whatever this "data" is into an import_data
table - assuming it doesn't change often.
2. Try a fill-factor of 50% or less - keeping the updates on the same
data page as the original might help
3. Drop the gin index before doing your bulk update and rebuild it at
the end. This is a common approach with bulk-loading / updates.
Oh - I'm assuming you're only updating those rows whose id has changed -
that seemed to be the suggestion in your first message. If not, simply
adding "AND make_id <> md.make_id" should help. Also (and you may well
have considered this) - for a normalised setup you'd just have the
model-id in "imports" and look up the make-id through the "models" table.
--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Richard Huxton wrote:
The only differences I can think of are WAL logging (transaction
log) and index updates (the temp table has no indexes).
What about foreign keys? Are there any tables which reference the
updated column in a foreign key declaration? Do they have indexes
on that column?
-Kevin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Resolved by subject fallback
Just to close this up and give some guidance to future googlers...
There are no foreign keys. The table is empty when I start. I copy the
data into it from a text file.
Removing indexes didn't help much (made a very slight difference).
running a query CREATE TEMP TABLE tt AS SELECT .... using a massive
join takes about 8 seconds. I presume that's the baseline for the disk
and RAM given my current postgres configuration. Note that this is
not a satisfactory option for me because I can't do what I want in one
step (the update I specified is one of many).
running a very simple update "UPDATE imports set make_id = null"
takes over 50 seconds so that's the minimum amount of time any update
is going to take.
Running a complex update where I join all the tables together and
update all the fields takes about 106 seconds.
Just running a complex select with the joins takes no time at all.
I tried chunking the updates using chunks of 100 records and 1000
records (where ID between X and Y repeatedly) and it was even slower.
Conclusion. Updates on postgres are slow (given the default
postgresql.conf). I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot. Removing the
indexes doesn't help that much.
Suggestion for the PG team. Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 21/01/13 20:09, Tim Uckun wrote:
Just to close this up and give some guidance to future googlers...
Careful, future googlers.
Conclusion. Updates on postgres are slow
Nope.
(given the default
postgresql.conf). I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot.
Unlikely. Do you really think that a PostgreSQL installation typically
runs 100 times slower on updates than inserts and every other user has
just said "oh, that's ok then"? Or is it more likely that something
peculiar is broken on your setup.
Removing the indexes doesn't help that much.
Suggestion for the PG team. Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.
--
Richard Huxton
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, Jan 21, 2013 at 1:23 PM, Richard Huxton <dev@archonet.com> wrote:
On 21/01/13 20:09, Tim Uckun wrote:
Just to close this up and give some guidance to future googlers...
Careful, future googlers.
Conclusion. Updates on postgres are slow
Nope.
(given the default
postgresql.conf). I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot.Unlikely. Do you really think that a PostgreSQL installation typically runs
100 times slower on updates than inserts and every other user has just said
"oh, that's ok then"? Or is it more likely that something peculiar is broken
on your setup.Removing the indexes doesn't help that much.
Suggestion for the PG team. Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.
I'd be curious to see results of the same "update" on a standard HDD
vs the SSD, and maybe on a more typical database deployment hardware
vs a macbook air.
--patrick
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Nope.
If you have any suggestions I am all ears. For the purposes of this
discussion we can narrow down the problem this update statement.
Update imports set make_id = null.
There are 98K records in the table. There is no index on the make_id
field. Standard untouched postgresql.conf from the brew install of
postgres.
Unlikely. Do you really think that a PostgreSQL installation typically runs
100 times slower on updates than inserts and every other user has just said
"oh, that's ok then"? Or is it more likely that something peculiar is broken
on your setup.
I really don't know. That's why I am here asking. I don't think
anything particular is broken with my system. As mentioned above the
setup is really simple. Standard postgres install, the default conf
file, update one field on one table. It takes fifty plus seconds.
I concede that if I was to go into the postgres.conf and make some
changes it will probably run faster (maybe much faster) but I wanted
to exhaust other factors before I went messing with the default
install.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I'd be curious to see results of the same "update" on a standard HDD
vs the SSD, and maybe on a more typical database deployment hardware
vs a macbook air.
I haven't tried it on any other machine yet. CREATE TEMP TABLE tt as
SELECT ... takes eight seconds so presumably the disk is not the
choke point.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Richard Huxton wrote:
On 21/01/13 20:09, Tim Uckun wrote:
Just to close this up and give some guidance to future
googlers...
Careful, future googlers.
+1
Conclusion. Updates on postgres are slow
Nope.
Agreed.
(given the default postgresql.conf). I presume this is due to
MVCC or the WAL or something and there are probably some things
I can do to tweak the conf file to make them go faster but out
of the box running an update on a table with lots of rows is
going to cost you a lot.
Unlikely. Do you really think that a PostgreSQL installation
typically runs 100 times slower on updates than inserts and every
other user has just said "oh, that's ok then"? Or is it more
likely that something peculiar is broken on your setup.
As someone who has managed hundreds of databases ranging up to over
3TB without seeing this without some very specific cause, I agree
that there is something wonky on Tim's setup which he hasn't told
us about. Then again, I'm not sure we've pushed hard enough for the
relevant details.
Tim, if you're still interested in resolving this, please post the
results from running the SQL code on this page:
http://wiki.postgresql.org/wiki/Server_Configuration
It might be worthwhile to read through this page:
http://wiki.postgresql.org/wiki/Slow_Query_Questions
... and try some of the ideas there. Base disk perfromance numbers
would help put the results in perspective.
The cause could be anything from table bloat due to inadequate
vacuuming to hardware problems.
-Kevin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Resolved by subject fallback
Tim Uckun wrote:
If you have any suggestions I am all ears. For the purposes of this
discussion we can narrow down the problem this update statement.Update imports set make_id = null.
Well, that simplifies things.
First off, what does it say for rows affected? (Hint, if you really
are using a default configuration and it doesn't say 0 rows
affected, please show us the actual query used.)
Second, try connecting to the database as a superuser and running:
VACUUM ANALYZE imports;
-- (show us the results)
VACUUM FULL imports;
VACUUM FREEZE ANALYZE; -- (don't specify a table)
Then try your query and see whether performance is any different.
-Kevin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Resolved by subject fallback
On Mon, Jan 21, 2013 at 1:48 PM, Tim Uckun <timuckun@gmail.com> wrote:
I'd be curious to see results of the same "update" on a standard HDD
vs the SSD, and maybe on a more typical database deployment hardware
vs a macbook air.I haven't tried it on any other machine yet. CREATE TEMP TABLE tt as
SELECT ... takes eight seconds so presumably the disk is not the
choke point.
you are making an assumption that a fresh write is the same as a
re-write. try the test.
--patrick
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Kevin Grittner wrote:
First off, what does it say for rows affected? (Hint, if you really
are using a default configuration and it doesn't say 0 rows
affected, please show us the actual query used.)
Never mind that bit -- I got myself confused. Sorry for the noise.
-Kevin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Resolved by subject fallback
First off, what does it say for rows affected? (Hint, if you really
are using a default configuration and it doesn't say 0 rows
affected, please show us the actual query used.)
update imports set make_id = null
Query returned successfully: 98834 rows affected, 49673 ms execution time.
vacuum analyze imports
Query returned successfully with no result in 4138 ms.
VACUUM FULL imports;
Query returned successfully with no result in 38106 ms.
VACUUM FREEZE ANALYZE;
Query returned successfully with no result in 184635 ms
update imports set make_id = 0
Query returned successfully: 98834 rows affected, 45860 ms execution time.
So all the vacuuming saved about four seconds of execution time.
here is the postgresql.conf completely untouched from the default
install https://gist.github.com/4590590
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Oh I forgot
SELECT version();
"PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple
clang version 4.1 (tags/Apple/clang-421.11.65) (based on LLVM 3.1svn),
64-bit"
SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');
"application_name";"pgAdmin III - Query Tool";"client"
"bytea_output";"escape";"session"
"client_encoding";"UNICODE";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, DMY";"session"
"default_text_search_config";"pg_catalog.english";"configuration file"
"lc_messages";"en_NZ.UTF-8";"configuration file"
"lc_monetary";"en_NZ.UTF-8";"configuration file"
"lc_numeric";"en_NZ.UTF-8";"configuration file"
"lc_time";"en_NZ.UTF-8";"configuration file"
"log_timezone";"NZ";"configuration file"
"max_connections";"20";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"search_path";"chrysler, public";"session"
"shared_buffers";"1600kB";"configuration file"
"TimeZone";"NZ";"configuration file"
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 01/21/2013 03:45 PM, Tim Uckun wrote:
Oh I forgot
...
"shared_buffers";"1600kB";"configuration file"
You *reduced* shared buffers to 1.6MB? IIRC the typical default is 32MB
and the most common adjustment is to *increase* shared buffers. Most of
my servers are set to 2GB.
Try bumping that up to a reasonable value
(http://www.postgresql.org/docs/9.2/static/runtime-config-resource.html)
and share the results. Don't forget to restart PG after changing that
setting.
Cheers,
Steve
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 01/21/2013 04:15 PM, Steve Crawford wrote:
On 01/21/2013 03:45 PM, Tim Uckun wrote:
Oh I forgot
...
Me, too. I forgot to ask for the table definition. If there are
variable-length fields like "text" or "varchar", what is the typical
size of the data.
Also, what is the physical size of the table (\dt+ yourtable)?
Perhaps even the output of
select * from pg_stat_user_tables where relname='yourtable';
might be useful.
Cheers,
Steve
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I already posted the schema earlier. It's a handful of integer fields
with one hstore field.
On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
select * from pg_stat_user_tables where relname='yourtable';
Messy output
"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Kevin Grittner wrote:
update imports set make_id = 0
Query returned successfully: 98834 rows affected, 45860 ms execution time.
For difficult problems, there is nothing like a self-contained test
case, that someone else can run to see the issue. Here's a starting
point:
create extension if not exists hstore;
create schema cars;
drop table if exists cars.imports;
CREATE TABLE cars.imports
(
id serial NOT NULL,
target_id integer,
batch_id integer,
make_id integer,
model_id integer,
date timestamp without time zone,
division_id integer,
dealer_id integer,
data hstore,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT imports_pkey PRIMARY KEY (id)
);
CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
vacuum freeze analyze;
\timing on
update cars.imports set make_id = 0;
vacuum analyze;
update cars.imports set make_id = 0;
... and here's what I get when I run it on my desktop computer with
ordinary disk drives and a completely default configuration:
test=# create extension if not exists hstore;
CREATE EXTENSION
Time: 48.032 ms
test=# create schema cars;
CREATE SCHEMA
Time: 8.150 ms
test=# drop table if exists cars.imports;
NOTICE: table "imports" does not exist, skipping
DROP TABLE
Time: 0.205 ms
test=# CREATE TABLE cars.imports
test-# (
test(# id serial NOT NULL,
test(# target_id integer,
test(# batch_id integer,
test(# make_id integer,
test(# model_id integer,
test(# date timestamp without time zone,
test(# division_id integer,
test(# dealer_id integer,
test(# data hstore,
test(# created_at timestamp without time zone NOT NULL,
test(# updated_at timestamp without time zone NOT NULL,
test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
test(# );
NOTICE: CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
CREATE TABLE
Time: 152.677 ms
test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX
Time: 6.391 ms
test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX
Time: 64.668 ms
test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX
Time: 65.573 ms
test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX
Time: 64.959 ms
test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
CREATE INDEX
Time: 64.906 ms
test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
INSERT 0 100000
Time: 2516.559 ms
test=# vacuum freeze analyze;
VACUUM
Time: 3357.778 ms
test=# \timing on
Timing is on.
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 2937.241 ms
test=# vacuum analyze;
VACUUM
Time: 2097.426 ms
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 3935.939 ms
Ubuntu 12.10
i7-3770 CPU @ 3.40GHz with 16GB RAM
Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.
PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
test=# SELECT name, current_setting(name), source
test-# FROM pg_settings
test-# WHERE source NOT IN ('default', 'override');
name | current_setting | source
----------------------------+--------------------+----------------------
application_name | psql | client
client_encoding | UTF8 | client
DateStyle | ISO, MDY | configuration file
default_text_search_config | pg_catalog.english | configuration file
lc_messages | en_US.UTF-8 | configuration file
lc_monetary | en_US.UTF-8 | configuration file
lc_numeric | en_US.UTF-8 | configuration file
lc_time | en_US.UTF-8 | configuration file
log_timezone | US/Central | configuration file
max_connections | 100 | configuration file
max_stack_depth | 2MB | environment variable
shared_buffers | 32MB | configuration file
TimeZone | US/Central | configuration file
(13 rows)
I did absolutely no tuning from the default configuration.
So, what timings do you get if you run the identical script? Is
there something you can do to the above script (maybe in terms of
populating data) which will cause the performance you see?
-Kevin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Import Notes
Resolved by subject fallback