Select all invalid e-mail addresses

Started by Andrusover 20 years ago29 messagesgeneral
Jump to latest
#1Andrus
eetasoft@online.ee

I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

Andrus.

#2Dann Corbit
DCorbit@connx.com
In reply to: Andrus (#1)
Re: Select all invalid e-mail addresses

This might be handy:
http://www.databasejournal.com/img/email_val.sql

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Andrus
Sent: Wednesday, October 19, 2005 11:12 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Select all invalid e-mail addresses

I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

Andrus.

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 6: explain analyze is your friend

#3Edmund Bacon
ebacon-xlii@onesystem.com
In reply to: Andrus (#1)
Re: Select all invalid e-mail addresses

"Andrus" <eetasoft@online.ee> writes:

I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

There was a thread here not so long ago about matching valid email addresses.
It's not so simple. You probably want to do a regex match - e.g.

select ... where email ~ '<regex>'

However the regex for all valid email possibilities is *VERY* complex.
see: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

You should probably search the maillist archives. ISTR that there were
some suggestions on how one might simplify the search space.

#4Guy Rouillier
guyr@masergy.com
In reply to: Edmund Bacon (#3)
Re: Select all invalid e-mail addresses

Andrus wrote:

I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.

--
Guy Rouillier

#5Michael Fuhr
mike@fuhr.org
In reply to: Andrus (#1)
Re: Select all invalid e-mail addresses

On Wed, Oct 19, 2005 at 09:12:16PM +0300, Andrus wrote:

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

The rules that define a valid email address are more complex than
most people realize, and even if an address is syntactically valid
that doesn't mean it's valid in the sense that you can deliver mail
to it. Whatever method you end up using, be sure to understand its
limitations.

One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:

CREATE FUNCTION is_valid_email(text) RETURNS boolean AS $$
use Email::Valid;
return Email::Valid->address($_[0]) ? "true" : "false";
$$ LANGUAGE plperlu IMMUTABLE STRICT;

You could then do something like:

SELECT * FROM foo WHERE NOT is_valid_email(email_address);

Again, be aware that passing this or any other test doesn't necessarily
mean that an address is truly valid -- it's just an attempt to identify
addresses that are obviously bogus.

--
Michael Fuhr

#6Hannes Dorbath
light@theendofthetunnel.de
In reply to: Michael Fuhr (#5)
Re: Select all invalid e-mail addresses

On 19.10.2005 21:18, Michael Fuhr wrote:

One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:

..and if you don't mind installing pl/php, you can use this function:

http://hannes.imos.net/validmail.html

It performs a MX-lookup, which IMHO is the best way to check for validity.

--
Regards,
Hannes Dorbath

#7Doug McNaught
doug@mcnaught.org
In reply to: Hannes Dorbath (#6)
Re: Select all invalid e-mail addresses

Hannes Dorbath <light@theendofthetunnel.de> writes:

On 19.10.2005 21:18, Michael Fuhr wrote:

One possibility would be to write a plperlu function that uses the
Email::Valid module. Here's a trivial example; see the Email::Valid
documentation to learn about its full capabilities:

..and if you don't mind installing pl/php, you can use this function:

http://hannes.imos.net/validmail.html

It performs a MX-lookup, which IMHO is the best way to check for validity.

But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?

-Doug

#8Hannes Dorbath
light@theendofthetunnel.de
In reply to: Hannes Dorbath (#6)
Re: Select all invalid e-mail addresses

On 20.10.2005 14:00, Douglas McNaught wrote:

But that's expensive and slow

Sure, that isn't meant to be used in a WHERE condition on a 100k row
table.. more to be bound via check constraint on a user table, so
incomming data is validated.

and doesn't tell you whether the user part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?

The point is to validate as good as possible, and as you said yourself,
the user part can't be validated further.

--
Regards,
Hannes Dorbath

#9Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
Re: Select all invalid e-mail addresses

""Guy Rouillier"" <guyr@masergy.com> wrote in message
news:CC1CF380F4D70844B01D45982E671B239E8BE9@mtxexch01.add0.masergy.com...

Andrus wrote:

I have a database of e-mail addresses.

I want to select the email addresses which are not valid:

do not contain exactly one @ character,
contain ; > < " ' , characters or spaces etc.

What is the WHERE clause for this ?

Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.

Guy Rouillier,

thank you.

I have emails in CHARACTER(60) type columns in database, total 3000 emails.
I need to check email addresses for most frequent typos before send.
I have only plpgsql language installed, no perl.

From this thread I got the regular expression

/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/How I can use this in where clause ? I havent never used regular expressionsin Postgres.How to exclude top-level domain names from this regex ?Andrus.

#10Brian Mathis
brian.mathis@gmail.com
In reply to: Doug McNaught (#7)
Re: Select all invalid e-mail addresses

On 10/20/05, Douglas McNaught <doug@mcnaught.org> wrote:

It performs a MX-lookup, which IMHO is the best way to check for

validity.

But that's expensive and slow, and doesn't tell you whether the user
part of the address is valid (and in general, there's no way to
determine that short of actually sending a message). So what's the
point?

-Doug

That's why I think the better term for this is "well formed". "Validity" can
only be determined by sending to it, but you can tell if an address at least
conforms to the specs programmatically. In the end, when talking about a
"valid" address in this context, that is what most people are talking about.

The point is to weed out malformed email addresses, just like you would
enforce any other data formatting standards in other types of data.

#11Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Andrus (#9)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:

From this thread I got the regular expression

[snipped]

Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

A

--
Andrew Sullivan | ajs@crankycanuck.ca
Information security isn't a technological problem. It's an economics
problem.
--Bruce Schneier

#12Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Brian Mathis (#10)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 11:22:25AM -0400, Brian Mathis wrote:

That's why I think the better term for this is "well formed". "Validity" can
only be determined by sending to it, but you can tell if an address at least

In fact, it can only be determined by sending to it over and over
again, because whether a mail address works may change over time (and
may have nothing to do with the poor schmuck whose email
administrators don't know how to spell "MX record").

A

--
Andrew Sullivan | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
--Dennis Ritchie

#13Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
Re: Select all invalid e-mail addresses

"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

Andrew, thank you.

I understand now that I do'nt want to validate TLDs at all.

I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

Andrus.

Show quoted text

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#14Steve Atkins
steve@blighty.com
In reply to: Andrew Sullivan (#11)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:

On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:

From this thread I got the regular expression

[snipped]

Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD.

You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.

(The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...

Cheers,
Steve

#15Steve Atkins
steve@blighty.com
In reply to: Andrus (#13)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:

"Andrew Sullivan" <ajs@crankycanuck.ca> wrote in message

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

Andrew, thank you.

I understand now that I do'nt want to validate TLDs at all.

I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

... WHERE email !~ '...insert previously mentioned regex here...';

Cheers,
Steve

#16Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Andrus (#13)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

Then I think the validating function someone else sent here
(<http://www.databasejournal.com/img/email_val.sql&gt;) is a good start.
You probably want the opposite behaviour -- emailinvalidate(), I
guess -- but that seems like a good "obviously wrong" tester. It
might not be fast, though -- that loop at the special character
check looks pretty painful.

A

--
Andrew Sullivan | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

#17Dann Corbit
DCorbit@connx.com
In reply to: Andrew Sullivan (#16)
Re: Select all invalid e-mail addresses

Interesting article:
http://coveryourasp.com/ValidateEmail.asp

See also:
http://search.cpan.org/~cwest/Email-Address-1.80/lib/Email/Address.pm
http://www.faqs.org/rfcs/rfc2822.html
http://docs.python.org/lib/module-rfc822.html

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Steve Atkins
Sent: Thursday, October 20, 2005 12:35 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:

On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:

From this thread I got the regular expression

[snipped]

Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way

you

don't have to change a regex every time ICANN decides to add another
TLD.

You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.

(The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet,

and

it doesn't appear to have arpa, either.)

While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...

Cheers,
Steve

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 4: Have you searched our list archives?

http://archives.postgresql.org

#18Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Steve Atkins (#14)
Re: Select all invalid e-mail addresses

On Thu, Oct 20, 2005 at 12:34:39PM -0700, Steve Atkins wrote:

While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...

You know, as someone who has been bitten hundreds of times by the
decision of some application designer who thought s/he knew better
than I what my email address could possibly be, I respectfully submit
that you're mistaken. We call it a bug when other databases accept
dates like '0000-00-00'; but we'd just as surely call it a bug if
PostgreSQL refused to accept valid leap year dates or leap seconds.
It's one thing to say you should not accept known-bad data; it's
quite another to refuse data that is improbable but nevertheless
perfectly good.

A

--
Andrew Sullivan | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to. That actually seems sort of quaint now.
--J.D. Baldwin

#19Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
Re: Select all invalid e-mail addresses

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

... WHERE email !~ '...insert previously mentioned regex here...';

Steve,

thank you.

I tried

SELECT email FROM customer
WHERE email !~
'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'

but got an error

ERROR: invalid regular expression: invalid character range

I'm using

"PostgreSQL 8.0.2 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)"

Andrus.

#20Michael Fuhr
mike@fuhr.org
In reply to: Andrus (#19)
Re: Select all invalid e-mail addresses

On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:

I tried

SELECT email FROM customer
WHERE email !~
'/^[^@]*@(?:[^@]*\.)?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'

but got an error

ERROR: invalid regular expression: invalid character range

Aside from the fact that this regular expression is semantically wrong,
it has a few other problems:

* A hyphen (-) must come first or last in a character class if you want
it interpreted literally instead of as part of a range specification.

test=> SELECT 'abc' ~ '[a-z0-9-_]'; -- WRONG
ERROR: invalid regular expression: invalid character range

test=> SELECT 'abc' ~ '[a-z0-9_-]';
?column?
----------
t
(1 row)

* Regular expressions in PostgreSQL don't use delimiters like / at
the beginning and end of the expression.

test=> SELECT 'abc' ~ '/abc/'; -- WRONG
?column?
----------
f
(1 row)

test=> SELECT 'abc' ~ 'abc';
?column?
----------
t
(1 row)

* If you use single quotes around the regular expression then you
need to escape backslashes that should be part of the regular
expression; otherwise the backslash will be parsed by the string
parser before the string is used as a regular expression and you'll
get unexpected results. In other words, there's an extra layer of
string parsing that you have to allow for. In 8.0 and later you
can avoid this by using dollar quotes.

test=> SELECT 'abc' ~ 'a\.c'; -- WRONG
?column?
----------
t
(1 row)

test=> SELECT 'abc' ~ 'a\\.c';
?column?
----------
f
(1 row)

test=> SELECT 'a.c' ~ 'a\\.c';
?column?
----------
t
(1 row)

test=> SELECT 'abc' ~ $$a\.c$$;
?column?
----------
f
(1 row)

test=> SELECT 'a.c' ~ $$a\.c$$;
?column?
----------
t
(1 row)

--
Michael Fuhr

#21Steve Atkins
steve@blighty.com
In reply to: Andrus (#19)
#22Michael Fuhr
mike@fuhr.org
In reply to: Steve Atkins (#21)
#23Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
#24Michael Fuhr
mike@fuhr.org
In reply to: Andrus (#23)
#25Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
#26Michael Fuhr
mike@fuhr.org
In reply to: Andrus (#25)
#27Steve Atkins
steve@blighty.com
In reply to: Michael Fuhr (#26)
#28Andrus
eetasoft@online.ee
In reply to: Guy Rouillier (#4)
#29Scott Marlowe
smarlowe@g2switchworks.com
In reply to: Andrus (#28)