COPY and indices?
Hi all,
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's what I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurements as to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writing indices vs data.
Thanks!
François
Hi,
On 13 March 2012 15:11, François Beausoleil <francois@teksol.info> wrote:
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Yes, it will be faster to use COPY than plain INSERTs. We have similar
situation -- up to 10k rows every 3 minutes and around 15 indexes on
the table. Table is portioned and we do not update data. Check bloat
query reports some bloat but it growing very slowly and there is new
partition every month.
Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's what I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurements as to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writing indices vs data.
Drop all indexes, measure time to insert and collect iostat output.
Create indexes, repeat the process and compare the results
--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
2012/3/12 François Beausoleil <francois@teksol.info>:
Hi all,
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Copy is faster, but if a single row fails (dup key etc) the whole copy fails.
Scott Marlowe wrote:
2012/3/12 Franᅵois Beausoleil <francois@teksol.info>:
Hi all,
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Copy is faster, but if a single row fails (dup key etc) the whole copy fails.
Hi Scott,
my impression: it would be even faster to drop the indices, do the bulk copy,
and rebuild the indices after the fact.
Regards
Wolfgang Hamann
Import Notes
Resolved by subject fallback
2012/3/12 François Beausoleil <francois@teksol.info>:
Hi all,
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's what I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurements as to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writing indices vs data.
you're almost certainly blocking on fsync. A real quick'n'dirty way
to confirm this (although it wont be as fast as COPY) would be to wrap
your inserts in a transaction. VMs tend to have really horrible
storage latency which can hurt postgres performance. Another option
would be to relax your commit policy (for example by flipping
synchronous_commit) if that fits within your safety requirements.
merlin
Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit :
2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's what I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurements as to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writing indices vs data.
you're almost certainly blocking on fsync. A real quick'n'dirty way
to confirm this (although it wont be as fast as COPY) would be to wrap
your inserts in a transaction. VMs tend to have really horrible
storage latency which can hurt postgres performance. Another option
would be to relax your commit policy (for example by flipping
synchronous_commit) if that fits within your safety requirements.
I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have checkpoint_segments set to 96, and 10 minutes.
I'll go with the COPY, since I can live with the batched requirements just fine.
Bye!
François
On Tue, Mar 13, 2012 at 12:51 AM, <hamann.w@t-online.de> wrote:
Scott Marlowe wrote:
2012/3/12 François Beausoleil <francois@teksol.info>:Hi all,
When using COPY FROM STDIN to stream thousands of rows (20k and more hourly), what happens with indices? Are they updated only once after the operation, or are they updated once per row? Note that I'm not replacing the table's data: I'm appending to what's already there. I suspect batching writes will be faster than writing each individual row using an INSERT statement.
Copy is faster, but if a single row fails (dup key etc) the whole copy fails.
Hi Scott,
my impression: it would be even faster to drop the indices, do the bulk copy,
and rebuild the indices after the fact.
That depends. If you've got 100M rows already in place and you're
adding 1000 rows, it's likely cheaper to leave the indexes in place.
OTOH, if you've 1M rows already in place and are adding 100M it's
almost certainly cheaper to drop the indexes and recreate them. So
dropping and recreating the indexes isn't necessarily faster,
depending on how big the table already is.
2012/3/13 François Beausoleil <francois@teksol.info>:
Le mardi 13 mars 2012 à 10:48, Merlin Moncure a écrit :
2012/3/12 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
Currently, I can sustain 30-40 writes per second on a Rackspace VPS. I know it's not the ideal solution, but that's what I'm working with. Following vmstat, the server is spending 30 to 40% of it's time in iowait. I don't have measurements as to what files are touched, and I'd welcome suggestions to measure the time PostgreSQL actually spends writing indices vs data.
you're almost certainly blocking on fsync. A real quick'n'dirty way
to confirm this (although it wont be as fast as COPY) would be to wrap
your inserts in a transaction. VMs tend to have really horrible
storage latency which can hurt postgres performance. Another option
would be to relax your commit policy (for example by flipping
synchronous_commit) if that fits within your safety requirements.I already applied the tricks you have here: I have a transaction, and synchronous_commit is off. I also have checkpoint_segments set to 96, and 10 minutes.
I'll go with the COPY, since I can live with the batched requirements just fine.
30-40 'in transaction' i/o bound inserts is so slow as to not really
be believable unless each record is around 1 megabyte because being in
transaction removes storage latency from the equation. Even on a
crappy VM. As a point of comparison my sata workstation drive can do
in the 10s of thousands. How many records are you inserting per
transaction?
merlin
Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit :
2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
I'll go with the COPY, since I can live with the batched requirements just fine.
30-40 'in transaction' i/o bound inserts is so slow as to not really
be believable unless each record is around 1 megabyte because being in
transaction removes storage latency from the equation. Even on a
crappy VM. As a point of comparison my sata workstation drive can do
in the 10s of thousands. How many records are you inserting per
transaction?
I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2
The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40 transactions per second, I was actually referring to the number of messages processed from my message queue. Going by the PostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated.
Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost always backed up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can manage my backlog better.
I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more throughput.
Thanks!
François
2012/3/14 François Beausoleil <francois@teksol.info>:
Le mardi 13 mars 2012 à 11:15, Merlin Moncure a écrit :
2012/3/13 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
I'll go with the COPY, since I can live with the batched requirements just fine.
30-40 'in transaction' i/o bound inserts is so slow as to not really
be believable unless each record is around 1 megabyte because being in
transaction removes storage latency from the equation. Even on a
crappy VM. As a point of comparison my sata workstation drive can do
in the 10s of thousands. How many records are you inserting per
transaction?I took the time to gather statistics about the database server: https://gist.github.com/07bbf8a5b05b1c37a7f2
The files are a series of roughly 30 second samples, while the system is under production usage. When I quoted 30-40 transactions per second, I was actually referring to the number of messages processed from my message queue. Going by the PostgreSQL numbers, xact_commit tells me I manage 288 commits per second. It's much better than I anticipated.
Anyways, if anybody has comments on how I could increase throughput, I'd appreciate. My message queues are almost always backed up by 1M messages, and it's at least partially related to PostgreSQL: if the DB can write faster, I can manage my backlog better.
I'm still planning on going with batch processing, but I need to do something ASAP to give me just a bit more throughput.
well your iowait numbers are through the roof which makes things
pretty simple from a diagnosis point of view: your storage is
overloaded. the only remedies are to try and make your queries more
efficient so that you are doing less writing, better use of
transactions, etc. but looking at the log it appears the low hanging
fruit is already grabbed (synchronous_commit=off, etc). so you have
to choose from a list of not very pleasant options:
*) fsync=off
*) tune the application
*) bring more/faster storage online. a single ssd would probably make
your problem disappear. in the vm world, hopefully you can at least
bring another volume online and move your wal to that.
*) HARDWARE.
In the entirety of my career, I have never found anything more
perplexing than the general reluctance to upgrade hardware to solve
hardware related performance bottlenecks. Virtualization is great
technology but is nowhere near good enough in my experience to handle
high transaction rate database severs. A 5000$ server will solve your
issue, and you'll spend that in two days scratching your head trying
to figure out the issue (irritating your customers all the while).
merlin
Le mercredi 14 mars 2012 à 15:06, Merlin Moncure a écrit :
2012/3/14 François Beausoleil <francois@teksol.info (mailto:francois@teksol.info)>:
In the entirety of my career, I have never found anything more
perplexing than the general reluctance to upgrade hardware to solve
hardware related performance bottlenecks. Virtualization is great
technology but is nowhere near good enough in my experience to handle
high transaction rate database severs. A 5000$ server will solve your
issue, and you'll spend that in two days scratching your head trying
to figure out the issue (irritating your customers all the while).
Thank you for your analysis, Merlin. I already suspected as much.
Have a great day!
François