BUG #5021: ts_parse doesn't recognize email addresses with underscores
The following bug has been logged online:
Bug reference: 5021
Logged by: Dan O'Hara
Email address: danarasoftware@gmail.com
PostgreSQL version: 8.3.7
Operating system: win32
Description: ts_parse doesn't recognize email addresses with
underscores
Details:
In the following example,
select distinct token as email
On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wrote:
The following bug has been logged online:
Bug reference: 5021
Logged by: Dan O'Hara
Email address: danarasoftware@gmail.com
PostgreSQL version: 8.3.7
Operating system: win32
Description: ts_parse doesn't recognize email addresses with
underscores
Details:In the following example,
select distinct token as email
from ts_parse('default', ' first_last@yahoo.com ' )
where tokid = 4ts_parse returns last@yahoo.com rather than first_last@yahoo.com It seems
that any text prior to the underscore is truncated. If the portion
following the underscore is only numeric, such as this example,select distinct token as email
from ts_parse('default', ' bill_2000@yahoo.com ' )
where tokid = 4then ts_parse returns nothing at all.
section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
an email address.
I don't think this has much to do with email addresses. If you do:
select token from ts_parse('a_b');
...you get three tokens. In your case you're pulling out the fourth
token, but some of your examples don't have four tokens, so then you
get nothing at all.
I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm.
...Robert
Thanks for having a look at this bug.
According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.
The list of recognized data formats for parse can be selected via this query:
SELECT * FROM ts_token_type('default');
The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres. This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.
Regards
Dan
On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wrote:
The following bug has been logged online:
Bug reference: 5021
Logged by: Dan O'Hara
Email address: danarasoftware@gmail.com
PostgreSQL version: 8.3.7
Operating system: win32
Description: ts_parse doesn't recognize email addresses with
underscores
Details:In the following example,
select distinct token as email
from ts_parse('default', ' first_last@yahoo.com ' )
where tokid = 4ts_parse returns last@yahoo.com rather than first_last@yahoo.com It seems
that any text prior to the underscore is truncated. If the portion
following the underscore is only numeric, such as this example,select distinct token as email
from ts_parse('default', ' bill_2000@yahoo.com ' )
where tokid = 4then ts_parse returns nothing at all.
section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
an email address.I don't think this has much to do with email addresses. If you do:
select token from ts_parse('a_b');
...you get three tokens. In your case you're pulling out the fourth
token, but some of your examples don't have four tokens, so then you
get nothing at all.I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm....Robert
--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733
Robert Haas escreveu:
I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm.
It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail
but the function fails to report that.
It is not that simple to identify an e-mail address that agrees with RFC. As
that code is a state machine, IMHO it decides too early (when it finds _) that
that string is not an e-mail address. AFAIR, that's not an one-line fix.
euler=# select distinct token as email from ts_parse('default',
'foo.bar@baz.com');
email
─────────────────
foo.bar@baz.com
(1 row)
euler=# select distinct token as email from ts_parse('default',
'foo+bar@baz.com');
email
─────────────
foo
+
bar@baz.com
(3 rows)
euler=# select distinct token as email from ts_parse('default',
'foo_bar@baz.com');
email
─────────────
foo
bar@baz.com
_
(3 rows)
--
Euler Taveira de Oliveira
http://www.timbira.com/
I agree that it isn't easy to determine if given text is a valid email
address. As I couldn't use ts_parse, I ended up using a regex, which
worked substantially better at pulling out the emails from the text
stream. I haven't looked at the code, but perhaps it is possible to
do the same thing here? Even a regex that is 99% correct would be
better than the current tokenizer which is only right about 80-85% of
the time.
My workaround looked something like this:
select regexp_matches(resumetext,E'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}','gi')
as email
from "Resume"
cheers
Dan
On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:
Robert Haas escreveu:
I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm.It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail
but the function fails to report that.It is not that simple to identify an e-mail address that agrees with RFC. As
that code is a state machine, IMHO it decides too early (when it finds _) that
that string is not an e-mail address. AFAIR, that's not an one-line fix.euler=# select distinct token as email from ts_parse('default',
'foo.bar@baz.com');
─────────────────
foo.bar@baz.com
(1 row)euler=# select distinct token as email from ts_parse('default',
'foo+bar@baz.com');
─────────────
foo
+
bar@baz.com
(3 rows)euler=# select distinct token as email from ts_parse('default',
'foo_bar@baz.com');
─────────────
foo
bar@baz.com
_
(3 rows)--
Euler Taveira de Oliveira
http://www.timbira.com/
--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware@gmail.com
613 288-8733
Euler Taveira de Oliveira escribi�:
Robert Haas escreveu:
I'm not real familiar with ts_parse(), but I'm thinking that it
doesn't have any special casing for email addresses and is just
intended to parse text for full-text-search - in which case splitting
on _ is a pretty good algorithm.It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, foo+bar@baz.com is a valid e-mail
but the function fails to report that.
It is similarly too-simplistic for other cases too, like file names
(particularly where Windows filenames are concerned).
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Oleg, Teodor, can you look at this? I tried to fix it in wparser_def.c,
but couldn't figure out how. Thanks.
---------------------------------------------------------------------------
Dan O'Hara wrote:
The following bug has been logged online:
Bug reference: 5021
Logged by: Dan O'Hara
Email address: danarasoftware@gmail.com
PostgreSQL version: 8.3.7
Operating system: win32
Description: ts_parse doesn't recognize email addresses with
underscores
Details:In the following example,
select distinct token as email
from ts_parse('default', ' first_last@yahoo.com ' )
where tokid = 4ts_parse returns last@yahoo.com rather than first_last@yahoo.com It seems
that any text prior to the underscore is truncated. If the portion
following the underscore is only numeric, such as this example,select distinct token as email
from ts_parse('default', ' bill_2000@yahoo.com ' )
where tokid = 4then ts_parse returns nothing at all.
section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
an email address.http://tools.ietf.org/html/rfc5322
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
+ If your life is a hard drive, Christ can be your backup. +
Oleg, Teodor, can you look at this? I tried to fix it in wparser_def.c,
but couldn't figure out how. Thanks.select distinct token as email
from ts_parse('default', ' first_last@yahoo.com ' )
where tokid = 4
Patch in attachment, it allows underscore in the middle of local part of email
in in host name (similarly to '-' character).
I'm not sure about backpatching, because it could break existing search
configuration.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
Teodor Sigaev wrote:
Oleg, Teodor, can you look at this? I tried to fix it in wparser_def.c,
but couldn't figure out how. Thanks.select distinct token as email
from ts_parse('default', ' first_last@yahoo.com ' )
where tokid = 4Patch in attachment, it allows underscore in the middle of local part of email
in in host name (similarly to '-' character).
Thanks, patch applied.
I'm not sure about backpatching, because it could break existing search
configuration.
Agreed. I don't think this warrants backpatching.
Here is the before behavior:
test=> select ts_parse('default', ' first_last@yahoo.com ' );
ts_parse
--------------------
(12," ")
(1,first)
(12,_)
--> (4,last@yahoo.com)
(12," ")
(5 rows)
and the after-patch, fixed behavior:
test=> select ts_parse('default', ' first_last@yahoo.com ' );
ts_parse
--------------------------
(12," ")
--> (4,first_last@yahoo.com)
(12," ")
(3 rows)
I assume because this only expands the pattern space for email addresses
that there is no affect on binary upgrades with this patch. Is that
correct? Would an email address check on a binary-upgraded tsvector
index not match an email address with underscores? Do we need a warning
in the release notes about this?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Upon seeing this patch I considered that I use addresses such as
alvherre+stuff@something.org and wondered how could this thing support
that. I don't think we want extra parser stuff just to add whatever
random junk we want to support in email addresses ...
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote:
Upon seeing this patch I considered that I use addresses such as
alvherre+stuff@something.org and wondered how could this thing support
that. I don't think we want extra parser stuff just to add whatever
random junk we want to support in email addresses ...
Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:
http://en.wikipedia.org/wiki/E-mail_address
* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.
And we don't currently honor most of the special characters, including
plus:
test=> select ts_parse('default', ' first+last@yahoo.com ' );
ts_parse
--------------------
(12," ")
(1,first)
(12,+)
(4,last@yahoo.com)
(12," ")
(5 rows)
Where does this leave us? Do we add the other characters? Do we
document that we only allow a limited number of characters for email
addresses? What is the logic in that? Do any of these characters
conflict with our tsquery operators?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Bruce Momjian <bruce@momjian.us> writes:
Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:
* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.
That's an awful lot of special characters. For the RFC's purposes,
it's not hard to be flexible because in an email message there is
external context telling where to expect an address. I think if we
tried to allow all of those in email addresses in tsearch, we'd have
"email addresses" gobbling up a whole lot of adjacent text, to nobody's
benefit.
I can see the case for adding "+" because that's fairly common as Alvaro
notes, but I think we should be very circumspect about going farther.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.That's an awful lot of special characters. For the RFC's purposes,
it's not hard to be flexible because in an email message there is
external context telling where to expect an address. I think if we
tried to allow all of those in email addresses in tsearch, we'd have
"email addresses" gobbling up a whole lot of adjacent text, to nobody's
benefit.I can see the case for adding "+" because that's fairly common as Alvaro
notes, but I think we should be very circumspect about going farther.
OK, I can add '+' using Teodor's patch as a guide, and document which
characters we support, and that we don't support all of them. What
about the binary upgrade issue? I am now worried that maybe we should
back out the patch and just document our restrictions.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.That's an awful lot of special characters. For the RFC's purposes,
it's not hard to be flexible because in an email message there is
external context telling where to expect an address. I think if we
tried to allow all of those in email addresses in tsearch, we'd have
"email addresses" gobbling up a whole lot of adjacent text, to nobody's
benefit.I can see the case for adding "+" because that's fairly common as Alvaro
notes, but I think we should be very circumspect about going farther.
I've been working with recognizing email addresses in text for
years, with many millions of documents processed. Recognizing
them in text is a very different problem to validating them or sanitizing
them. Using the RFC spec to match things that "might be an email
address" isn't a great idea in the wild, so +1 on the circumspect.
I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
of "real" email addresses in free text in the wild, without getting being
too prone to grab things that just look vaguely like email addresses. Obviously
there are some things it'll match that aren't email addresses, and some
email addresses it won't match, but for indexing it's been really pretty
good when combined with a good regex for domain parts[1]([a-z0-9_][^<\"@\\s]{0,80})@([a-z0-9._-]{0,252}\\.(?:[a-z]{2}|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|travel|jobs|mobi|tel|cat).
Cheers,
Steve
[1]: ([a-z0-9_][^<\"@\\s]{0,80})@([a-z0-9._-]{0,252}\\.(?:[a-z]{2}|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|travel|jobs|mobi|tel|cat)
(Before you point out all the ways that differs from the RFC specs for
an email address, yes, I know, but that's the point. Real world usage
is not the same as RFC spec.) [2]This is the simplified version - the full version is marginally more selective, at the expense of being much more complex.
[2]: This is the simplified version - the full version is marginally more selective, at the expense of being much more complex.
selective, at the expense of being much more complex.
Bruce Momjian <bruce@momjian.us> writes:
OK, I can add '+' using Teodor's patch as a guide, and document which
characters we support, and that we don't support all of them. What
about the binary upgrade issue? I am now worried that maybe we should
back out the patch and just document our restrictions.
The tsearch stuff is supposedly designed to be flexible about that.
It's not really any different from changing your tsearch configuration
midstream. You might not get matches that you would have liked to get,
but there's not any internal inconsistency.
regards, tom lane
Steve Atkins wrote:
On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, I think the big question is whether we need to honor RFC 5322
(http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
all valid characters:* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.That's an awful lot of special characters. For the RFC's purposes,
it's not hard to be flexible because in an email message there is
external context telling where to expect an address. I think if we
tried to allow all of those in email addresses in tsearch, we'd have
"email addresses" gobbling up a whole lot of adjacent text, to nobody's
benefit.I can see the case for adding "+" because that's fairly common as Alvaro
notes, but I think we should be very circumspect about going farther.I've been working with recognizing email addresses in text for
years, with many millions of documents processed. Recognizing
them in text is a very different problem to validating them or sanitizing
them. Using the RFC spec to match things that "might be an email
address" isn't a great idea in the wild, so +1 on the circumspect.I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
of "real" email addresses in free text in the wild, without getting being
too prone to grab things that just look vaguely like email addresses. Obviously
there are some things it'll match that aren't email addresses, and some
email addresses it won't match, but for indexing it's been really pretty
good when combined with a good regex for domain parts[1].
OK, based on your experience, I think we have gone far enough by
allowing underscores. I have applied the attached patch to document
what symbols we do allow.
Just for thrills, I want to point out that even the description is not
accurate. Look what happens when a dash follows an underscore:
test=> select ts_parse('default', ' a-b_c@yahoo.com ' );
ts_parse
---------------------
(12," ")
(4,a-b_c@yahoo.com)
(12," ")
(3 rows)
test=> select ts_parse('default', ' a-b-_c@yahoo.com ' );
ts_parse
-----------------
(12," ")
(16,a-b)
(11,a)
(12,-)
(11,b)
(12,-_)
(4,c@yahoo.com)
(12," ")
(8 rows)
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Attachments:
/rtmp/difftext/x-diffDownload
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 textsearch.sgml
*** doc/src/sgml/textsearch.sgml 14 Aug 2009 14:53:20 -0000 1.53
--- doc/src/sgml/textsearch.sgml 13 Mar 2010 03:03:24 -0000
***************
*** 1943,1948 ****
--- 1943,1955 ----
languages, token types <literal>word</> and <literal>asciiword</>
should be treated alike.
</para>
+
+ <para>
+ <literal>email</> does not support all valid email characters as
+ defined by RFC 5322. Specifically, the only non-alphanumeric
+ characters supported for email user names are period, dash, and
+ underscore.
+ </para>
</note>
<para>
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
OK, I can add '+' using Teodor's patch as a guide, and document which
characters we support, and that we don't support all of them. What
about the binary upgrade issue? I am now worried that maybe we should
back out the patch and just document our restrictions.The tsearch stuff is supposedly designed to be flexible about that.
It's not really any different from changing your tsearch configuration
midstream. You might not get matches that you would have liked to get,
but there's not any internal inconsistency.
Glad you are not worried. ;-)
What concerns me is if someone created a tsvector index or stored
tsvector output in a table using the old rules, a new query might not
match the binary-upgraded tsvector stored values.
You are right that internally it is fine.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do