Trigger Performance

Started by Randall Smithabout 15 years ago9 messagesgeneral
Jump to latest
#1Randall Smith
randall@tnr.cc

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable. However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name). The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:

...
IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name)) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;
...

This trigger is called only BEFORE INSERTS.

In testing, without the trigger, inserts are at 10,000 every 2 seconds.
With the trigger, they are 10,000 every 40 seconds. The output of
explain analyze suggests that this shouldn't be taking so long.

EXPLAIN ANALYZE SELECT EXISTS (
SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat');

Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1
width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (volume_id = 300)
Filter: (name = 'whodat'::text)
Total runtime: 0.053 ms
(6 rows)

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records, far from the actual 38 seconds it is adding. I've
tried to change up the query in the trigger to see if I could get
different results with not much luck. Any idea what might be taking up
the extra time or what I can do to troubleshoot?

Thanks.

-Randall

#2Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 15 Jan 2011, at 23:52, Randall Smith wrote:

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable.

But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique index could just replace the existing one.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

!DSPAM:737,4d323b1e11871071717716!

#3Jasen Betts
jasen@xnet.co.nz
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 2011-01-15, Randall Smith <randall@tnr.cc> wrote:

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable. However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name). The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:

...
IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name)) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;
...

In plpgsql IF is an implicit select.
<http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html&gt;

IF EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;

i don't know if that will help performance though

Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1
width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (volume_id = 300)
Filter: (name = 'whodat'::text)
Total runtime: 0.053 ms
(6 rows)

I got 0.4ms the first time I tried this.

that's actual runtime is helped by cache locality, reconnect and try
it again and you'll see a worse figure
( I got a factor of three difference)

at 53us ir probably didn't hit the hard disk, when that starts
happening things will get much worse.

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

huh?

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records,

why? what are you doing to 10000 records.

far from the actual 38 seconds it is adding. I've
tried to change up the query in the trigger to see if I could get
different results with not much luck. Any idea what might be taking up
the extra time or what I can do to troubleshoot?

Try it without that check (on sample data) and see how much faster it goes
(just comment out that part of the trigger)
when I tested that here I could not see a definate difference.
(timings were all over the place, some slower some faster)

how much disk (in bytes, and dollars) are you hoping to save by not
using the index.

--
⚂⚃ 100% natural

#4Randall Smith
randall@tnr.cc
In reply to: Jasen Betts (#3)
Re: Trigger Performance

Before reading. This is solved. Was an error on my part.

On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote:

In plpgsql IF is an implicit select.
<http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html&gt;

IF EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;

Thanks. Certainly more concise.

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

huh?

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records,

why? what are you doing to 10000 records.

Inserting them. Sorry, that was a critical omission on my part. The
trigger check is for inserts and I'm testing its performance by
inserting 10,000 records.

Turns out my EXPLAIN ANALYZE times were so low because I was querying
for a volume_id that had only a few rows. When I query for a volume_id
with 10,000 rows, it changes to 7 ms, which matches the performance I'm
seeing. That's acceptable to me because that's probably at the upper
end of what I'll see. 7 ms to check 10,000 text fields is actually
impressive to me.

how much disk (in bytes, and dollars) are you hoping to save by not
using the index.

I need to be able to fit my indexes in RAM. This table will have a few
billion records and I have several other indexes with billions of
records and I'd like my DB to run well on a machine with 20G (preferred)
60G (max) RAM and not have to resort to sharding. These text fields can
be up to 1k each. A 1 billion row int8 index comes in around 2G.
Adding the text field to the index would probably put it at over 20G per
billion records.

Thanks.

-Randall

#5Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 16 Jan 2011, at 5:38, Randall Smith wrote:

But you already do have an index on that id-field, so what's the problem with using a unique constraint? Its unique index could just replace the existing one.

I'm OK with indexing the 8 byte integer, but I don't want to index the
text field that could possibly be up to 1k. In some simple tests,
indexing both fields, the index was larger than the table.

Ah, it was about a different field - that makes much more sense :)

You probably don't need the full width of that field to be fairly certain that it's going to be unique from that point on anyway. The chance that a unique constraint kicks in when it shouldn't gets lower the wider your index on that field gets, but it's probably sufficiently low with relatively few characters, while it's probably not that big a deal if it kicks in early in a few cases.

For example, if you only index the first 100 bytes of each of these fields, you are very unlikely to run into a situation where your constraint claims the field is not unique while it actually is.

That shrinks your index down by a factor of 10, and IMO 100 bytes is still on the wide side for a unique text column. I don't think many people successfully enter 100 characters in a row without ever making an error.

If instead the field contains computer-generated data, then there surely is an easier way to detect that this text is not going to be unique. You could store the values of the parameters used to generate that data for example, and put a unique constraint on those.

I'd also wonder what the value is of two distinct records with the same 1000 characters of text up to the last one. Are those distinct on purpose or is it more likely that someone made an error entering that field and accidentally managed to create a unique entry while in fact the unique constraint was supposed to kick in?

Besides that, you say there will be billions of records, but that text field needs to be unique? I really do hope the contents of that field will be computer-generated, or your users will get quite desperate coming up with new values to use after a while, especially if the contents need to make any sense.

But of course much depends on the nature of the data in your field. You haven't said anything about that.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4d32bdf511764853411139!

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Randall Smith (#1)
Re: Trigger Performance

Randall Smith <randall@tnr.cc> writes:

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable. However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

You realize of course that this is fundamentally broken and cannot be
trusted? Worrying about the performance seems rather pointless.

The reason it can't be trusted is that two concurrent insertions will
neither see the other one.

regards, tom lane

#7Fredric Fredricson
Fredric.Fredricson@bonetmail.com
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 01/15/2011 11:52 PM, Randall Smith wrote:

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable. However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name). The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:

A possible work around could be to store an extra field that contains
the md5 hash of the text field and use that field in the index instead
of the text itself. The md5 sum could be calculated by your trigger.

/Fredric

Show quoted text

...
IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name)) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;
...

This trigger is called only BEFORE INSERTS.

In testing, without the trigger, inserts are at 10,000 every 2 seconds.
With the trigger, they are 10,000 every 40 seconds. The output of
explain analyze suggests that this shouldn't be taking so long.

EXPLAIN ANALYZE SELECT EXISTS (
SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat');

Result (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
rows=1 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using volume_id_idx on t1 (cost=0.00..8.35 rows=1
width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (volume_id = 300)
Filter: (name = 'whodat'::text)
Total runtime: 0.053 ms
(6 rows)

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records, far from the actual 38 seconds it is adding. I've
tried to change up the query in the trigger to see if I could get
different results with not much luck. Any idea what might be taking up
the extra time or what I can do to troubleshoot?

Thanks.

-Randall

#8Jasen Betts
jasen@xnet.co.nz
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 2011-01-16, Randall Smith <randall@tnr.cc> wrote:

Before reading. This is solved. Was an error on my part.

On Sun, 2011-01-16 at 03:46 +0000, Jasen Betts wrote:

In plpgsql IF is an implicit select.
<http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html&gt;

IF EXISTS (SELECT 1 FROM t1 WHERE
volume_id = NEW.volume_id AND name = NEW.name) THEN
RAISE EXCEPTION '% already exists on volume', NEW.name;
END IF;

Thanks. Certainly more concise.

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

huh?

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records,

why? what are you doing to 10000 records.

Inserting them. Sorry, that was a critical omission on my part. The
trigger check is for inserts and I'm testing its performance by
inserting 10,000 records.

that'll have n.log(n) complexity (or worse) you can't simply multiply by
10000 because each row added to the index slows the average index lookup
time a little.

Turns out my EXPLAIN ANALYZE times were so low because I was querying
for a volume_id that had only a few rows. When I query for a volume_id
with 10,000 rows, it changes to 7 ms, which matches the performance I'm
seeing. That's acceptable to me because that's probably at the upper
end of what I'll see. 7 ms to check 10,000 text fields is actually
impressive to me.

how much disk (in bytes, and dollars) are you hoping to save by not
using the index.

I need to be able to fit my indexes in RAM.

why? Indices that don't fit are still useful. Doing 20 or so record
retreivals to confirm the absense of a record is likely to be slower
than doing 5 reads and a write or two to check and update a disk-based
index.

postgres isn't going to blindly load all the indices into ram.

I need to be able to fit my indexes in RAM. This table will have a few
billion records These text fields can
be up to 1k each.

this query will create about 300000 sample records with 1K text
repeat it with dufferent numbers in the first generate_series
10001,20000
20001,30000
etc
until you get 2 billion records

then create your indices and do your testing.

insert into t1 select a::integer as volume_id,
md5((a*b)::text) || md5((a*b+1)::text) || md5((a*b+3)::text)
|| md5((a*b+4)::text) || md5((a*b+5)::text) || md5((a*b+6)::text)
|| md5((a*b+7)::text) || md5((a*b+8)::text) || md5((a*b+9)::text)
|| md5((a*b+11)::text) || md5((a*b+12)::text) || md5((a*b+13)::text)
|| md5((a*b+14)::text) || md5((a*b+15)::text) as name from
(select generate_series( 1 ,10000) as a ,generate_series(1,29) as b) as foo;

--
⚂⚃ 100% natural

#9Jasen Betts
jasen@xnet.co.nz
In reply to: Randall Smith (#1)
Re: Trigger Performance

On 2011-01-17, Fredric Fredricson <Fredric.Fredricson@bonetmail.com> wrote:

This is a multi-part message in MIME format.
--------------000600030501040304020000
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 01/15/2011 11:52 PM, Randall Smith wrote:

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table. Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable. However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name). The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:

A possible work around could be to store an extra field that contains
the md5 hash of the text field and use that field in the index instead
of the text itself. The md5 sum could be calculated by your trigger.

you don't have to store it, only index on it.

create unique index t1_nodups on t1( volume_id,md5(name));

--
⚂⚃ 100% natural