Select all invalid e-mail addresses
I have a database of e-mail addresses.
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.
What is the WHERE clause for this ?
Andrus.
This might be handy:
http://www.databasejournal.com/img/email_val.sql
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Andrus
Sent: Wednesday, October 19, 2005 11:12 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Select all invalid e-mail addressesI have a database of e-mail addresses.
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.What is the WHERE clause for this ?
Andrus.
---------------------------(end of
broadcast)---------------------------
Show quoted text
TIP 6: explain analyze is your friend
Import Notes
Resolved by subject fallback
"Andrus" <eetasoft@online.ee> writes:
I have a database of e-mail addresses.
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.What is the WHERE clause for this ?
There was a thread here not so long ago about matching valid email addresses.
It's not so simple. You probably want to do a regex match - e.g.
select ... where email ~ '<regex>'
However the regex for all valid email possibilities is *VERY* complex.
see: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
You should probably search the maillist archives. ISTR that there were
some suggestions on how one might simplify the search space.
Andrus wrote:
I have a database of e-mail addresses.
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.What is the WHERE clause for this ?
Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.
--
Guy Rouillier
Import Notes
Resolved by subject fallback
On Wed, Oct 19, 2005 at 09:12:16PM +0300, Andrus wrote:
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.
The rules that define a valid email address are more complex than
most people realize, and even if an address is syntactically valid
that doesn't mean it's valid in the sense that you can deliver mail
to it. Whatever method you end up using, be sure to understand its
limitations.
One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:
CREATE FUNCTION is_valid_email(text) RETURNS boolean AS $$
use Email::Valid;
return Email::Valid->address($_[0]) ? "true" : "false";
$$ LANGUAGE plperlu IMMUTABLE STRICT;
You could then do something like:
SELECT * FROM foo WHERE NOT is_valid_email(email_address);
Again, be aware that passing this or any other test doesn't necessarily
mean that an address is truly valid -- it's just an attempt to identify
addresses that are obviously bogus.
--
Michael Fuhr
On 19.10.2005 21:18, Michael Fuhr wrote:
One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:
..and if you don't mind installing pl/php, you can use this function:
http://hannes.imos.net/validmail.html
It performs a MX-lookup, which IMHO is the best way to check for validity.
--
Regards,
Hannes Dorbath
Hannes Dorbath <light@theendofthetunnel.de> writes:
On 19.10.2005 21:18, Michael Fuhr wrote:
One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:..and if you don't mind installing pl/php, you can use this function:
http://hannes.imos.net/validmail.html
It performs a MX-lookup, which IMHO is the best way to check for validity.
But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?
-Doug
On 20.10.2005 14:00, Douglas McNaught wrote:
But that's expensive and slow
Sure, that isn't meant to be used in a WHERE condition on a 100k row
table.. more to be bound via check constraint on a user table, so
incomming data is validated.
and doesn't tell you whether the user part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?
The point is to validate as good as possible, and as you said yourself,
the user part can't be validated further.
--
Regards,
Hannes Dorbath
""Guy Rouillier"" <guyr@masergy.com> wrote in message
news:CC1CF380F4D70844B01D45982E671B239E8BE9@mtxexch01.add0.masergy.com...
Andrus wrote:
I have a database of e-mail addresses.
I want to select the email addresses which are not valid:
do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.What is the WHERE clause for this ?
Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.
Guy Rouillier,
thank you.
I have emails in CHARACTER(60) type columns in database, total 3000 emails.
I need to check email addresses for most frequent typos before send.
I have only plpgsql language installed, no perl.
From this thread I got the regular expression
/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/How I can use this in where clause ? I havent never used regular expressionsin Postgres.How to exclude top-level domain names from this regex ?Andrus.
On 10/20/05, Douglas McNaught <doug@mcnaught.org> wrote:
It performs a MX-lookup, which IMHO is the best way to check for
validity.
But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?-Doug
That's why I think the better term for this is "well formed". "Validity" can
only be determined by sending to it, but you can tell if an address at least
conforms to the specs programmatically. In the end, when talking about a
"valid" address in this context, that is what most people are talking about.
The point is to weed out malformed email addresses, just like you would
enforce any other data formatting standards in other types of data.
On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
From this thread I got the regular expression
[snipped]
Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)
I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)
A
--
Andrew Sullivan | ajs@crankycanuck.ca
Information security isn't a technological problem. It's an economics
problem.
--Bruce Schneier
On Thu, Oct 20, 2005 at 11:22:25AM -0400, Brian Mathis wrote:
That's why I think the better term for this is "well formed". "Validity" can
only be determined by sending to it, but you can tell if an address at least
In fact, it can only be determined by sending to it over and over
again, because whether a mail address works may change over time (and
may have nothing to do with the poor schmuck whose email
administrators don't know how to spell "MX record").
A
--
Andrew Sullivan | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
--Dennis Ritchie
"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message
I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)
Andrew, thank you.
I understand now that I do'nt want to validate TLDs at all.
I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.
How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?
Andrus.
Show quoted text
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:
On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
From this thread I got the regular expression
[snipped]
Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD.
You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.
(The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)
While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...
Cheers,
Steve
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message
I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)Andrew, thank you.
I understand now that I do'nt want to validate TLDs at all.
I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?
... WHERE email !~ '...insert previously mentioned regex here...';
Cheers,
Steve
On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?
Then I think the validating function someone else sent here
(<http://www.databasejournal.com/img/email_val.sql>) is a good start.
You probably want the opposite behaviour -- emailinvalidate(), I
guess -- but that seems like a good "obviously wrong" tester. It
might not be fast, though -- that loop at the special character
check looks pretty painful.
A
--
Andrew Sullivan | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun
Interesting article:
http://coveryourasp.com/ValidateEmail.asp
See also:
http://search.cpan.org/~cwest/Email-Address-1.80/lib/Email/Address.pm
http://www.faqs.org/rfcs/rfc2822.html
http://docs.python.org/lib/module-rfc822.html
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Steve Atkins
Sent: Thursday, October 20, 2005 12:35 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select all invalid e-mail addressesOn Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:
On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
From this thread I got the regular expression
[snipped]
Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way
you
don't have to change a regex every time ICANN decides to add another
TLD.You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.(The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet,
and
it doesn't appear to have arpa, either.)
While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...Cheers,
Steve---------------------------(end of
broadcast)---------------------------
Show quoted text
TIP 4: Have you searched our list archives?
Import Notes
Resolved by subject fallback
On Thu, Oct 20, 2005 at 12:34:39PM -0700, Steve Atkins wrote:
While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...
You know, as someone who has been bitten hundreds of times by the
decision of some application designer who thought s/he knew better
than I what my email address could possibly be, I respectfully submit
that you're mistaken. We call it a bug when other databases accept
dates like '0000-00-00'; but we'd just as surely call it a bug if
PostgreSQL refused to accept valid leap year dates or leap seconds.
It's one thing to say you should not accept known-bad data; it's
quite another to refuse data that is improbable but nevertheless
perfectly good.
A
--
Andrew Sullivan | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to. That actually seems sort of quaint now.
--J.D. Baldwin
How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?... WHERE email !~ '...insert previously mentioned regex here...';
Steve,
thank you.
I tried
SELECT email FROM customer
WHERE email !~
'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'
but got an error
ERROR: invalid regular expression: invalid character range
I'm using
"PostgreSQL 8.0.2 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)"
Andrus.
On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:
I tried
SELECT email FROM customer
WHERE email !~
'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'but got an error
ERROR: invalid regular expression: invalid character range
Aside from the fact that this regular expression is semantically wrong,
it has a few other problems:
* A hyphen (-) must come first or last in a character class if you want
it interpreted literally instead of as part of a range specification.
test=> SELECT 'abc' ~ '[a-z0-9-_]'; -- WRONG
ERROR: invalid regular expression: invalid character range
test=> SELECT 'abc' ~ '[a-z0-9_-]';
?column?
----------
t
(1 row)
* Regular expressions in PostgreSQL don't use delimiters like / at
the beginning and end of the expression.
test=> SELECT 'abc' ~ '/abc/'; -- WRONG
?column?
----------
f
(1 row)
test=> SELECT 'abc' ~ 'abc';
?column?
----------
t
(1 row)
* If you use single quotes around the regular expression then you
need to escape backslashes that should be part of the regular
expression; otherwise the backslash will be parsed by the string
parser before the string is used as a regular expression and you'll
get unexpected results. In other words, there's an extra layer of
string parsing that you have to allow for. In 8.0 and later you
can avoid this by using dollar quotes.
test=> SELECT 'abc' ~ 'a\.c'; -- WRONG
?column?
----------
t
(1 row)
test=> SELECT 'abc' ~ 'a\\.c';
?column?
----------
f
(1 row)
test=> SELECT 'a.c' ~ 'a\\.c';
?column?
----------
t
(1 row)
test=> SELECT 'abc' ~ $$a\.c$$;
?column?
----------
f
(1 row)
test=> SELECT 'a.c' ~ $$a\.c$$;
?column?
----------
t
(1 row)
--
Michael Fuhr