bytea encode performance issues

Started by Sim Zacksover 17 years ago30 messagesgeneral

sim@compulab.co.il

over 17 years ago

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am using postgresql 8.2.7 on gentoo for my dbmail backend.
I am also testing it on mysql 5.

I am trying to figure out if I need to tune my database configuration or
if querying a bytea field is just not practical in postgresql.

Searching with the mysql database takes under a minute and with the
postgresql database it takes approximately 10. It gets better when I fix
up the query a little, such as removing the group by and having and
including the clause as part of the where, but not anywhere close to the
mysql level.

This is the query that is used (I know it is not as efficient as it
could be, but this is the query it comes with):

SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = 8 AND status IN (0,1 )
AND k.is_header = '0' GROUP BY m.message_idnr,k.messageblk HAVING
ENCODE(k.messageblk::bytea,'escape') LIKE '%John%'

The messageblk field is a bytea in postgresql and a longblob in mysql.
The only difference in the query is the MySQL does not need the encode
function.

I have plugged the query into the psql and mysql command line
applications, so I could evaluate the query without the application.

The database is using autovacuum and the estimated rows and the actual
rows are almost the same, so I assume it is working. There are 310266
rows in the dbmail_messageblks table.

Can someone make a suggestion for tuning the database?

The explain of the query is:
"HashAggregate (cost=43648.11..43648.85 rows=74 width=753)"
" -> Nested Loop (cost=42999.83..43647.74 rows=74 width=753)"
" -> Merge Join (cost=42999.83..43043.35 rows=74 width=769)"
" Merge Cond: (k.physmessage_id = m.physmessage_id)"
" -> Sort (cost=39264.12..39267.59 rows=1388 width=753)"
" Sort Key: k.physmessage_id"
" -> Seq Scan on dbmail_messageblks k
(cost=0.00..39191.68 rows=1388 width=753)"
" Filter: ((is_header = 0::smallint) AND
(encode(messageblk, 'escape'::text) ~~ '%John%'::text))"
" -> Sort (cost=3735.71..3754.59 rows=7552 width=16)"
" Sort Key: m.physmessage_id"
" -> Bitmap Heap Scan on dbmail_messages m
(cost=385.98..3249.26 rows=7552 width=16)"
" Recheck Cond: ((mailbox_idnr = 8) AND (status
= ANY ('{0,1}'::integer[])))"
" -> Bitmap Index Scan on dbmail_messages_8
(cost=0.00..384.10 rows=7552 width=0)"
" Index Cond: ((mailbox_idnr = 8) AND
(status = ANY ('{0,1}'::integer[])))"
" -> Index Scan using dbmail_physmessage_pkey on
dbmail_physmessage p (cost=0.00..8.15 rows=1 width=8)"
" Index Cond: (k.physmessage_id = p.id)"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiVhHwACgkQjDX6szCBa+o6wACgwa05ZbUBL4Ef18N4JJHQ2SP1
gfwAnjIA14QktV/Qs1TrPiY+Ma+rmJht
=WOQM
-----END PGP SIGNATURE-----

Alban Hertroys

dalroi@solfertje.student.utwente.nl

over 17 years ago

In reply to: Sim Zacks (#1)

Re: bytea encode performance issues

On Aug 3, 2008, at 12:12 PM, Sim Zacks wrote:

This is the query that is used (I know it is not as efficient as it
could be, but this is the query it comes with):

SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = 8 AND status IN (0,1 )
AND k.is_header = '0' GROUP BY m.message_idnr,k.messageblk HAVING
ENCODE(k.messageblk::bytea,'escape') LIKE '%John%'

That LIKE operator is probably your problem. An unbounded LIKE like
that (with a wildcard on both sides) means no index can be used,
hence you get a sequential scan.

There are apparently some possibilities with the new GIN indexes (or
maybe even using GIST), but I haven't had an opportunity to try those
yet. There were some messages about just that on this list recently.

If you create an index, make sure you create a _functional_ index
over ENCODE(messageblk, 'escape').

The messageblk field is a bytea in postgresql and a longblob in
mysql.
The only difference in the query is the MySQL does not need the encode
function.

Since when is e-mail binary data? I don't quite see why you'd use a
bytea field instead of text. If your problem is character encoding,
then just don't store that ("encode" the DB using SQLASCII).

Can someone make a suggestion for tuning the database?

An explain analyze would have been a bit more useful, but the biggest
issue is probably the seqscan.

The explain of the query is:
"HashAggregate (cost=43648.11..43648.85 rows=74 width=753)"
" -> Nested Loop (cost=42999.83..43647.74 rows=74 width=753)"
" -> Merge Join (cost=42999.83..43043.35 rows=74 width=769)"
" Merge Cond: (k.physmessage_id = m.physmessage_id)"
" -> Sort (cost=39264.12..39267.59 rows=1388
width=753)"
" Sort Key: k.physmessage_id"
" -> Seq Scan on dbmail_messageblks k
(cost=0.00..39191.68 rows=1388 width=753)"
" Filter: ((is_header = 0::smallint) AND
(encode(messageblk, 'escape'::text) ~~ '%John%'::text))"

Here is your problem, a sequential scan over a presumably large
table. It's either caused by the LIKE expression or by the lack of a
functional index on messageblk, or both.

If you change the type of the messageblk field to text you won't need
a functional index anymore (although that only saves time on index
creation and inserts/updates).

" -> Sort (cost=3735.71..3754.59 rows=7552 width=16)"
" Sort Key: m.physmessage_id"
" -> Bitmap Heap Scan on dbmail_messages m
(cost=385.98..3249.26 rows=7552 width=16)"
" Recheck Cond: ((mailbox_idnr = 8) AND
(status
= ANY ('{0,1}'::integer[])))"
" -> Bitmap Index Scan on dbmail_messages_8
(cost=0.00..384.10 rows=7552 width=0)"
" Index Cond: ((mailbox_idnr = 8) AND
(status = ANY ('{0,1}'::integer[])))"
" -> Index Scan using dbmail_physmessage_pkey on
dbmail_physmessage p (cost=0.00..8.15 rows=1 width=8)"
" Index Cond: (k.physmessage_id = p.id)"

I notice some other oddities in that query/design. Why is is_header a
smallint instead of a boolean? I'm assuming this is for compatibility
with an other database?

Why use status IN (0, 1) instead of more descriptive keys? Is it even
constrained this way, or could arbitrary numbers end up as status
(say 99) and if so, what happens to those messages?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,48958f34243483105918576!

Sim Zacks

sim@compulab.co.il

over 17 years ago

In reply to: Alban Hertroys (#2)

Re: bytea encode performance issues

The LIKE operator is likely the problem, but it is a critical part of an
email application. Searches are done by, "Show me all emails containing
the following word."

I've tried using TSearch2's full text index. It made the query 50%
faster, taking 5 minutes. This is still not even close to the less then
a minute in MySQL.

That LIKE operator is probably your problem. An unbounded LIKE like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

There are apparently some possibilities with the new GIN indexes (or
maybe even using GIST), but I haven't had an opportunity to try those
yet. There were some messages about just that on this list recently.

I don't think a functional index (or anything other then a FTI) would
accomplish anything, being that I am doing unbounded Likes.

If you create an index, make sure you create a _functional_ index over
ENCODE(messageblk, 'escape').

Email is binary when it contains attachments. I actually planned on
using an ASCII encoding, but the dbmail people specifically said not to.
I don't know if they were speaking from experience, or because ASCII
sounds bad.

Since when is e-mail binary data? I don't quite see why you'd use a
bytea field instead of text. If your problem is character encoding, then
just don't store that ("encode" the DB using SQLASCII).

As I mentioned, this is the system that came with dbmail. It runs on
both PostGresql and MySQL, so they may have done some compatibility
things. There are 4 statuses possible, 0,1,2,3 if you use the database
through the software then a 99 could never appear there.

Show quoted text

I notice some other oddities in that query/design. Why is is_header a
smallint instead of a boolean? I'm assuming this is for compatibility
with an other database?

Why use status IN (0, 1) instead of more descriptive keys? Is it even
constrained this way, or could arbitrary numbers end up as status (say
99) and if so, what happens to those messages?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:824,48958f30243481673380013!

Alban Hertroys

dalroi@solfertje.student.utwente.nl

over 17 years ago

In reply to: Sim Zacks (#3)

Re: bytea encode performance issues

On Aug 3, 2008, at 2:36 PM, Sim Zacks wrote:

The LIKE operator is likely the problem, but it is a critical part
of an
email application. Searches are done by, "Show me all emails
containing
the following word."

I've tried using TSearch2's full text index. It made the query 50%
faster, taking 5 minutes. This is still not even close to the less
then
a minute in MySQL.

That LIKE operator is probably your problem. An unbounded LIKE
like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

There are apparently some possibilities with the new GIN indexes (or
maybe even using GIST), but I haven't had an opportunity to try those
yet. There were some messages about just that on this list recently.

I don't think a functional index (or anything other then a FTI) would
accomplish anything, being that I am doing unbounded Likes.

That's why I suggested to use a text field instead of bytea.
IIRC, You can have an index on word triplets and use tsearch. I don't
have intimate knowledge on how that works though, hopefully other
people will chime in here.

Without the need to convert each row before comparing it, and with an
appropriate index, that should significantly speed up your queries.

If you create an index, make sure you create a _functional_ index
over
ENCODE(messageblk, 'escape').

Email is binary when it contains attachments. I actually planned on
using an ASCII encoding, but the dbmail people specifically said
not to.
I don't know if they were speaking from experience, or because ASCII
sounds bad.

It shouldn't be; those attachments are MIME or UU encoded, are they not?
Don't confuse ASCII and SQLASCII. The latter accepts characters from
any encoding, which is probably what you want.

Since when is e-mail binary data? I don't quite see why you'd use a
bytea field instead of text. If your problem is character
encoding, then
just don't store that ("encode" the DB using SQLASCII).

As I mentioned, this is the system that came with dbmail. It runs on
both PostGresql and MySQL, so they may have done some compatibility
things. There are 4 statuses possible, 0,1,2,3 if you use the database
through the software then a 99 could never appear there.

The software isn't the only client that might connect to the database.
It is usually bad practice to put data constraint logic in the client
instead of in the database. Especially since in client code there are
usually multiple sections of code that have to deal with those
constraints, which tends to result in small differences in their
handling.

Next to that, if those statuses would have a proper foreign key
constraint, it would be very easy to add labels to each status in a
way they would make a bit more sense than 0, 1, 2, 3.
I expect the label would be a sufficient foreign key by itself
though, no need for those silly numbers.

Well, there's probably not much you can do about that, being just a
user of dbMail.

I notice some other oddities in that query/design. Why is is_header a
smallint instead of a boolean? I'm assuming this is for compatibility
with an other database?

Why use status IN (0, 1) instead of more descriptive keys? Is it even
constrained this way, or could arbitrary numbers end up as status
(say
99) and if so, what happens to those messages?

You have a very odd way of quoting. You don't top post as such, but
you top post in context... Haven't seen that one before. Usually
people reply _below_ a section, not above it ;) It made my reply a
bit harder to write.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4895b34b243488085013917!

Lew

com.lewscanon@lew

over 17 years ago

In reply to: Sim Zacks (#3)

[OT] newsreader issue? (Was: bytea encode performance issues)

Sim Zacks wrote:

(quoting someone:)

That LIKE operator is probably your problem. An unbounded LIKE like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

Was the message to which you responded posted to the newsgroup? It isn't
appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

--
Lew

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Sim Zacks (#1)

Re: bytea encode performance issues

Sim Zacks <sim@compulab.co.il> writes:

The explain of the query is:

Could we see EXPLAIN ANALYZE, not EXPLAIN? Without actual facts
to work from, any suggestions would be mere guesswork.

Also, what can you tell us about the sizes of the messageblk
strings (max and avg would be interesting)?

regards, tom lane

Scott Marlowe

scott.marlowe@gmail.com

over 17 years ago

In reply to: Lew (#5)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

The com.lewscanon@lew email address is invalid. I tried to send it this email:

If I remember correctly, the news feed is gatewayed off the mailing
list, so it's possible for a message to the list to not appear in the
group if it gets dropped at the gateway. Sorry if this is redundant
info for you.

If you don't wish to receive spam on your regular email address
(understandable) then either switch to reading and posting with a
gmail account (got lots of invites, just ask) or set up a separate
account to receive emails from pgsql and set up spam assassin and (/
or) a couple of whitelists and use that. It's kinda rude to ask me a
question on a mailing list with an email address I can't respond to.
I spend time writing up an answer that only you needed to see, and
then can't send it to YOU, but only the whole list.

Show quoted text

On Sun, Aug 3, 2008 at 8:03 AM, Lew <com.lewscanon@lew> wrote:

Sim Zacks wrote:

(quoting someone:)

That LIKE operator is probably your problem. An unbounded LIKE like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

Was the message to which you responded posted to the newsgroup? It isn't
appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

--
Lew

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andrej Ricnik-Bay

andrej.groups@gmail.com

over 17 years ago

In reply to: Lew (#5)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

On 04/08/2008, Lew <com.lewscanon@lew> wrote:

Was the message to which you responded posted to the newsgroup? It isn't
appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

He was quoting Alban Hertroys, and it appeared on the general mailing list
(I didn't even know there was a Newsgroup).

There seems to be a problem with your mail address, however ... ;}

Scott Marlowe

scott.marlowe@gmail.com

over 17 years ago

In reply to: Scott Marlowe (#7)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

One last thing. I'd rather my emails just get dropped silently if
that's the minimum someone can do. Use a valid email address that
goes to /dev/null and I'll be happy. You may miss a few things sent
directly to you, but since that's not what you want anyway, it's no
big loss, right?

Show quoted text

On Sun, Aug 3, 2008 at 7:59 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

The com.lewscanon@lew email address is invalid. I tried to send it this email:

If I remember correctly, the news feed is gatewayed off the mailing
list, so it's possible for a message to the list to not appear in the
group if it gets dropped at the gateway. Sorry if this is redundant
info for you.

If you don't wish to receive spam on your regular email address
(understandable) then either switch to reading and posting with a
gmail account (got lots of invites, just ask) or set up a separate
account to receive emails from pgsql and set up spam assassin and (/
or) a couple of whitelists and use that. It's kinda rude to ask me a
question on a mailing list with an email address I can't respond to.
I spend time writing up an answer that only you needed to see, and
then can't send it to YOU, but only the whole list.

On Sun, Aug 3, 2008 at 8:03 AM, Lew <com.lewscanon@lew> wrote:

Sim Zacks wrote:

(quoting someone:)

That LIKE operator is probably your problem. An unbounded LIKE like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

Was the message to which you responded posted to the newsgroup? It isn't
appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

--
Lew

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Scott Marlowe

scott.marlowe@gmail.com

over 17 years ago

In reply to: Andrej Ricnik-Bay (#8)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

On Sun, Aug 3, 2008 at 8:01 PM, Andrej Ricnik-Bay
<andrej.groups@gmail.com> wrote:

On 04/08/2008, Lew <com.lewscanon@lew> wrote:

Was the message to which you responded posted to the newsgroup? It isn't
appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

He was quoting Alban Hertroys, and it appeared on the general mailing list
(I didn't even know there was a Newsgroup).

Wow! So, maybe everybody did need to see then. Still...

Show quoted text

There seems to be a problem with your mail address, however ... ;}

#11

Andrej Ricnik-Bay

andrej.groups@gmail.com

over 17 years ago

In reply to: Scott Marlowe (#9)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

On 04/08/2008, Scott Marlowe <scott.marlowe@gmail.com> wrote:

One last thing. I'd rather my emails just get dropped silently if
that's the minimum someone can do. Use a valid email address that
goes to /dev/null and I'll be happy. You may miss a few things sent
directly to you, but since that's not what you want anyway, it's no
big loss, right?

Aye ... dodgy spam-protection methods like that really suck.

#12

Sim Zacks

sim@compulab.co.il

over 17 years ago

In reply to: Lew (#5)

Re: [OT] newsreader issue? (Was: bytea encode performance issues)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I got the response by email, but it was also addressed to the mailing
list. My response was also an email sent to the mailing list, not to the
newsgroup. I got one other response, that I do not see in the newsgroup.
And I will reply to it also using the mailing list.

Sim

Lew wrote:

Sim Zacks wrote:

(quoting someone:)

That LIKE operator is probably your problem. An unbounded LIKE like that
(with a wildcard on both sides) means no index can be used, hence you
get a sequential scan.

Was the message to which you responded posted to the newsgroup? It
isn't appearing in my newsreader.

Who wrote the message you quoted (you failed to cite the source)?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiWk20ACgkQjDX6szCBa+pwzwCffdE3KZAg0f2TjUiq+gFCOrML
HM4An2wcV9G9aAE+94DH6Vwc6deMIHB4
=N0v/
-----END PGP SIGNATURE-----

#13

Sim Zacks

sim@compulab.co.il

over 17 years ago

In reply to: Tom Lane (#6)

Re: bytea encode performance issues

Tom Lane wrote:

Could we see EXPLAIN ANALYZE, not EXPLAIN? Without actual facts
to work from, any suggestions would be mere guesswork.

This was taken immediately after a vacuum analyze on the database.

"HashAggregate (cost=41596.68..41596.84 rows=16 width=764) (actual
time=488263.802..488263.837 rows=40 loops=1)"
" -> Nested Loop (cost=0.00..41596.60 rows=16 width=764) (actual
time=23375.445..488260.311 rows=40 loops=1)"
" -> Nested Loop (cost=0.00..41463.32 rows=16 width=780)
(actual time=23375.344..488231.994 rows=40 loops=1)"
" -> Seq Scan on dbmail_messageblks k
(cost=0.00..39193.21 rows=259 width=764) (actual time=30.662..486585.126
rows=2107 loops=1)"
" Filter: ((is_header = 0::smallint) AND
(encode(messageblk, 'escape'::text) ~~ '%Yossi%'::text))"
" -> Index Scan using dbmail_messages_2 on dbmail_messages
m (cost=0.00..8.75 rows=1 width=16) (actual time=0.777..0.777 rows=0
loops=2107)"
" Index Cond: (m.physmessage_id = k.physmessage_id)"
" Filter: ((mailbox_idnr = 8) AND (status = ANY
('{0,1}'::integer[])))"
" -> Index Scan using dbmail_physmessage_pkey on
dbmail_physmessage p (cost=0.00..8.32 rows=1 width=8) (actual
time=0.701..0.703 rows=1 loops=40)"
" Index Cond: (k.physmessage_id = p.id)"
"Total runtime: 488264.192 ms"

Also, what can you tell us about the sizes of the messageblk
strings (max and avg would be interesting)?

select max(length(messageblk)),avg(length(messageblk)) from
dbmail_messageblks
MAX AVG
532259; 48115.630147120314

#14

Tomasz Ostrowski

tometzky@batory.org.pl

over 17 years ago

In reply to: Sim Zacks (#1)

Re: bytea encode performance issues

On 2008-08-03 12:12, Sim Zacks wrote:

SELECT m.message_idnr,k.messageblk
FROM dbmail_messageblks k
JOIN dbmail_physmessage p ON k.physmessage_id = p.id
JOIN dbmail_messages m ON p.id = m.physmessage_id
WHERE
mailbox_idnr = 8
AND status IN (0,1 )
AND k.is_header = '0'
GROUP BY m.message_idnr,k.messageblk
HAVING ENCODE(k.messageblk::bytea,'escape') LIKE '%John%'

What is this encode() for? I think it is not needed and kills
performance, as it needs to copy every message body in memory, possibly
several times.

Why not just "HAVING k.messageblk LIKE '%John%'"?

Try this:

=> \timing

=> create temporary table test as
select
decode(
repeat(
'lorem ipsum dolor sit amet '
||s::text||E'\n'
,1000
),
'escape'
) as a
from generate_series(1,10000) as s;
SELECT
Time: 10063.807 ms

=> select count(*) from test where a like '%John%';
count
-------
0
(1 row)

Time: 1280.973 ms

=> select count(*) from test where encode(a,'escape') like '%John%';
count
-------
0
(1 row)

Time: 5690.097 ms

Without encode search is 5 times faster. And for bigger bytea a
difference is even worse.

Even better:

=> select count(*) from test where position('John' in a) != 0;
select count(*) from test where position('John' in a) != 0;
count
-------
0
(1 row)

Time: 1098.768 ms

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh

#15

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Sim Zacks (#13)

Re: bytea encode performance issues

Sim Zacks <sim@compulab.co.il> writes:

Tom Lane wrote:

Could we see EXPLAIN ANALYZE, not EXPLAIN? Without actual facts
to work from, any suggestions would be mere guesswork.

" -> Seq Scan on dbmail_messageblks k
(cost=0.00..39193.21 rows=259 width=764) (actual time=30.662..486585.126
rows=2107 loops=1)"
" Filter: ((is_header = 0::smallint) AND
(encode(messageblk, 'escape'::text) ~~ '%Yossi%'::text))"

okay, the time really is being spent in the seqscan ...

Also, what can you tell us about the sizes of the messageblk
strings (max and avg would be interesting)?

select max(length(messageblk)),avg(length(messageblk)) from
dbmail_messageblks
MAX AVG
532259; 48115.630147120314

... but given that, I wonder whether the cost isn't from fetching
the toasted messageblk data, and nothing directly to do with either
the encode() call or the ~~ test. It would be interesting to compare
the results of

explain analyze select encode(messageblk, 'escape') ~~ '%Yossi%'
from dbmail_messageblks where is_header = 0;

explain analyze select encode(messageblk, 'escape')
from dbmail_messageblks where is_header = 0;

explain analyze select messageblk = 'X'
from dbmail_messageblks where is_header = 0;

explain analyze select length(messageblk)
from dbmail_messageblks where is_header = 0;

(length is chosen with malice aforethought: unlike the other cases,
it doesn't require detoasting a toasted input)

regards, tom lane

#16

Sim Zacks

sim@compulab.co.il

over 17 years ago

In reply to: Tom Lane (#15)

Re: bytea encode performance issues

Results below:

... but given that, I wonder whether the cost isn't from fetching
the toasted messageblk data, and nothing directly to do with either
the encode() call or the ~~ test. It would be interesting to compare
the results of

explain analyze select encode(messageblk, 'escape') ~~ '%Yossi%'
from dbmail_messageblks where is_header = 0;

"Seq Scan on dbmail_messageblks (cost=0.00..38449.06 rows=162096
width=756) (actual time=0.071..492776.008 rows=166748 loops=1)"
" Filter: (is_header = 0)"
"Total runtime: 492988.410 ms"

explain analyze select encode(messageblk, 'escape')
from dbmail_messageblks where is_header = 0;

"Seq Scan on dbmail_messageblks (cost=0.00..38043.81 rows=162096
width=756) (actual time=16.008..306408.633 rows=166750 loops=1)"
" Filter: (is_header = 0)"
"Total runtime: 306585.369 ms"

explain analyze select messageblk = 'X'
from dbmail_messageblks where is_header = 0;

"Seq Scan on dbmail_messageblks (cost=0.00..38043.81 rows=162096
width=756) (actual time=18.169..251212.223 rows=166754 loops=1)"
" Filter: (is_header = 0)"
"Total runtime: 251384.900 ms"

explain analyze select length(messageblk)
from dbmail_messageblks where is_header = 0;

"Seq Scan on dbmail_messageblks (cost=0.00..38043.81 rows=162096
width=756) (actual time=20.436..2585.098 rows=166757 loops=1)"
" Filter: (is_header = 0)"
"Total runtime: 2673.840 ms"

Show quoted text

(length is chosen with malice aforethought: unlike the other cases,
it doesn't require detoasting a toasted input)

regards, tom lane

#17

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Sim Zacks (#16)

Re: bytea encode performance issues

Sim Zacks <sim@compulab.co.il> writes:

Results below:

... but given that, I wonder whether the cost isn't from fetching
the toasted messageblk data, and nothing directly to do with either
the encode() call or the ~~ test. It would be interesting to compare
the results of

Okay, so subtracting the totals we've got:

2.7 sec to scan the table proper

248.7 sec to fetch the toasted datums (well, this test also includes
an equality comparison, but since the text lengths are generally
going to be different, that really should be negligible)

55.2 sec to do the encode() calls

186.4 sec to do the LIKE comparisons

So getting rid of the encode() would help a bit, but it's hardly the
main source of your problem.

We've seen complaints about toast fetch time before. I don't think
there's any really simple solution. You could experiment with disabling
compression (SET STORAGE external) but I'd bet on that being a net loss
unless the data is only poorly compressible.

If the table is not updated very often, it's possible that doing a
CLUSTER every so often would help. I'm not 100% sure but I think that
would result in the toast table being rewritten in the same order as the
newly-built main table, which ought to cut down on the cost of fetching.

Also, what database encoding are you using? I note from the CVS logs
that some post-8.2 work was done to make LIKE faster in multibyte
encodings. (Though if you were doing the LIKE directly in bytea, that
wouldn't matter ... what was the reason for the encode() call again?)

regards, tom lane

#18

Sim Zacks

sim@compulab.co.il

over 17 years ago

In reply to: Tom Lane (#17)

Re: bytea encode performance issues

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Lane wrote:

We've seen complaints about toast fetch time before. I don't think
there's any really simple solution. You could experiment with disabling
compression (SET STORAGE external) but I'd bet on that being a net loss
unless the data is only poorly compressible.

I am trying it with External and then I'll try it with Plain.
However, this is a hugely long process.
After the alter table, then I have to update each row with an
update dbmail_messageblks set messageblk=messageblk;
so that it uses the new storage.
After that I have to vacuum analyze (which is taking over an hour so
far, most of it on toast).
After this is complete, I'll test the examples again and go through the
process using Plain.

If the table is not updated very often, it's possible that doing a
CLUSTER every so often would help. I'm not 100% sure but I think that
would result in the toast table being rewritten in the same order as the
newly-built main table, which ought to cut down on the cost of fetching.

This is an email table, it is never updated, but constant inserts.
Deletes only happen once a week. I'll try the cluster after I try the
storage changes.

Also, what database encoding are you using? I note from the CVS logs
that some post-8.2 work was done to make LIKE faster in multibyte
encodings. (Though if you were doing the LIKE directly in bytea, that
wouldn't matter ... what was the reason for the encode() call again?)

We are using UTF-8, and I am testing SQL-ASCII at the moment. DBMail is
a pre-built application, so until I am ready to start playing with its
internals I don't really have a choice about a number of its features.
The reason for the bytea is because of the multiple encodings, I have
suggested using SQL-ASCII to them and then it will be possible to use a
text datatype.
I don't know the reason for using the encode, I assumed that it was
because bytea wouldn't take a LIKE, but I see that I was mistaken. It
could be that in an earlier release LIKE was not supported against
bytea, but I don't know that for sure.

Thanks for your help
Sim
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiZpDoACgkQjDX6szCBa+pJVACfbkAQuvsOqCCFdlMzpC1rx5yp
KpAAoIV17U+gKjXcDYhlOjRIE1PHUbaK
=A+Ru
-----END PGP SIGNATURE-----

#19

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Sim Zacks (#18)

Re: bytea encode performance issues

Sim Zacks <sim@compulab.co.il> writes:

After the alter table, then I have to update each row with an
update dbmail_messageblks set messageblk=messageblk;
so that it uses the new storage.

I wouldn't actually bet on that changing anything at all ...
I'd try something like messageblk = messageblk || '' to make
completely sure the value gets detoasted.

regards, tom lane

#20

Richard Huxton

dev@archonet.com

over 17 years ago

In reply to: Sim Zacks (#18)

Re: bytea encode performance issues

Sim Zacks wrote:

DBMail is
a pre-built application, so until I am ready to start playing with its
internals I don't really have a choice about a number of its features.

Have you heard of this? Might be worth a quick look:

http://www.archiveopteryx.org/overview

--
Richard Huxton
Archonet Ltd

#21