Finding bogus dates

Started by Scott Ribeover 19 years ago17 messagesgeneral
Jump to latest
#1Scott Ribe
scott_ribe@killerbytes.com

Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#2Adam Rich
adam.r@sbcglobal.net
In reply to: Scott Ribe (#1)
Re: Finding bogus dates

I suppose you could create a boolean function that does a cast, and catches the
execption, returning NULL.

If that doesn't work, a perl stored procedure using Date::Calc and check_date() ?

----- Original Message ----
From: Scott Ribe <scott_ribe@killerbytes.com>
To: pgsql-general postgresql.org <pgsql-general@postgresql.org>
Sent: Thursday, January 18, 2007 12:26:03 PM
Subject: [GENERAL] Finding bogus dates

Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Scott Ribe (#1)
Re: Finding bogus dates

Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

I guess you can use something like this procedure:

create or replace function is_date(d varchar) returns boolean as $$
declare
tmp date;
begin
tmp := d::date;
return true;
exception
when others then
return false;
end;
$$ language plpgsql;

#4Scott Ribe
scott_ribe@killerbytes.com
In reply to: Adam Rich (#2)
Re: Finding bogus dates

I suppose you could create a boolean function that does a cast, and catches
the
execption, returning NULL.

Yes, I was puzzling over query syntax and didn't think about a function.
That would be fine.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#5A. Kretschmer
andreas.kretschmer@schollglas.com
In reply to: Scott Ribe (#1)
Re: Finding bogus dates

am Thu, dem 18.01.2007, um 11:26:03 -0700 mailte Scott Ribe folgendes:

Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

You can write a function for this, read this chapter in the docu about
trapping errors:

http://www.postgresql.org/docs/8.2/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net

#6Brandon Aiken
BAiken@winemantech.com
In reply to: Scott Ribe (#1)
Re: Finding bogus dates

Add a date field to the table.

Run UPDATE "foo" SET "newDate" = to_date("oldDate","MM/DD/YYYY").

Bad dates like 02/31/2006 will be converted to sane dates. 02/31/2006
--> 03/03/2006.

Now run SELECT * FROM "foo" WHERE to_char("newDate","MM/DD/YYYY") <>
"oldDate". If a date got changed for sanity reasons, it'll be
different.

That should get most of 'em.

--
Brandon Aiken
CS/IT Systems Engineer
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Ribe
Sent: Thursday, January 18, 2007 1:26 PM
To: pgsql-general postgresql.org
Subject: [GENERAL] Finding bogus dates

Suppose that I have a varchar column which contains dates, and some of
them
are bogus, such as '1968-02-31'. Is it possible to specify a query
condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to
clean
it up before importing into the final schema.) There probably aren't
that
many, and I can just keep hitting pg errors one row at a time until I
work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#7Scott Ribe
scott_ribe@killerbytes.com
In reply to: Brandon Aiken (#6)
Re: Finding bogus dates

I didn't know to_date would do that. It's better anyway. I just continued
with the "fix and try again" approach and they're only 2 bad dates out
94,000+, so I don't have a huge problem here. I can try to do some research
and find the correct date, but failing that, the to_date approximation is
probably no worse than using null.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#8Brandon Aiken
BAiken@winemantech.com
In reply to: Scott Ribe (#7)
Re: Finding bogus dates

Actually, now that I think about it a second you can find them really
easy just by doing:

SELECT * FROM "foo"
WHERE to_char(to_date("oldDate",'MM/DD/YYYY'),'MM/DD/YYYY') <>
"oldDate";

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: Scott Ribe [mailto:scott_ribe@killerbytes.com]
Sent: Thursday, January 18, 2007 3:48 PM
To: Brandon Aiken; pgsql-general postgresql.org
Subject: Re: [GENERAL] Finding bogus dates

I didn't know to_date would do that. It's better anyway. I just
continued
with the "fix and try again" approach and they're only 2 bad dates out
94,000+, so I don't have a huge problem here. I can try to do some
research
and find the correct date, but failing that, the to_date approximation
is
probably no worse than using null.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#9Scott Ribe
scott_ribe@killerbytes.com
In reply to: Brandon Aiken (#6)
Re: Finding bogus dates

Bad dates like 02/31/2006 will be converted to sane dates.

Actually, that's the core of the direct query!

select * from foo where to_date(olddate,'YYYY-MM-DD')::text <> olddate;

The format of the exported dates matches the default date::text format
because I specified the export that way. If not, the query would be a little
more complicated by having to specify the format, but the same basic idea
still would work.

Which is good, since I'm now on to a table that has 108 bogus dates. I'd
hate to find & fix them one at a time...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#10John D. Burger
john@mitre.org
In reply to: Scott Ribe (#9)
Re: Finding bogus dates

Scott Ribe wrote:

Actually, that's the core of the direct query!

select * from foo where to_date(olddate,'YYYY-MM-DD')::text <>
olddate;

The format of the exported dates matches the default date::text format
because I specified the export that way. If not, the query would be
a little
more complicated by having to specify the format, but the same
basic idea
still would work.

But this won't work if one had a text column of dates in various
formats, right? You could use a disjunction with lots of formats in
your query - or better yet, put all the formats you can think of in a
temp table and join against it:

select * from foo
where not exists (select format from dateformats
where to_char(to_date(olddate, format), format) = olddate);

But the date casting code seems to be even more general, or at least
seems to know about many more formats than I'd be likely to dream up
on my own.

- John D. Burger
MITRE

#11Scott Ribe
scott_ribe@killerbytes.com
In reply to: John D. Burger (#10)
Re: Finding bogus dates

But this won't work if one had a text column of dates in various
formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice

#12Ron Johnson
ron.l.johnson@cox.net
In reply to: Scott Ribe (#11)
PG not rejecting bad dates (was Re: Finding bogus dates)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:22, Scott Ribe wrote:

But this won't work if one had a text column of dates in various
formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time. This smacks of something that MySQL would do...

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFsAX+S9HxQb37XmcRAg9NAJ4mdQXdp0tLDpiLcTaOVVrZgUMdSACgjTti
9iVMxYAWNIKOwtfm3T38Aac=
=Rpbe
-----END PGP SIGNATURE-----

#13David Fetter
david@fetter.org
In reply to: Ron Johnson (#12)
Re: PG not rejecting bad dates (was Re: Finding bogus dates)

On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:22, Scott Ribe wrote:

But this won't work if one had a text column of dates in various
formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time. This smacks of something that MySQL would do...

I'm pretty sure it didn't accept these as bad dates, but as text
strings. As you point out, it's a MySQLism to take "we are all here
to go into space" as a valid date.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

#14Chad Wagner
chad.wagner@gmail.com
In reply to: Ron Johnson (#12)
Re: PG not rejecting bad dates (was Re: Finding bogus dates)

On 1/18/07, Ron Johnson <ron.l.johnson@cox.net> wrote:

Right. In my case I have bad data from a source I didn't control,

exported

via code that I do control which happens to output YYYY-MM-DD. Well,

except

that I don't do what I need to when MM or DD are more than 2 digits, but

I'm

going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time. This smacks of something that MySQL would do...

The original poster mentioned that the data type that the "date" was stored
in was a varchar, not really much it can do there if don't use the right
data type :(.

Not to mention how misleading it probably is to use a varchar for a data to
the optimizer for calculating selectivity.

--
Chad
http://www.postgresqlforums.com/

#15Ron Johnson
ron.l.johnson@cox.net
In reply to: David Fetter (#13)
Re: PG not rejecting bad dates (was Re: Finding bogus dates)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:52, David Fetter wrote:

On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:

On 01/18/07 17:22, Scott Ribe wrote:

But this won't work if one had a text column of dates in various
formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time. This smacks of something that MySQL would do...

I'm pretty sure it didn't accept these as bad dates, but as text
strings. As you point out, it's a MySQLism to take "we are all here
to go into space" as a valid date.

Ah, the relevant snippet from OP:

I have a varchar column

That would explain everything. Except why it's a VARCHAR instead of
DATE. But that's a whole 'nother discussion.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFsAtRS9HxQb37XmcRArUSAJ9p519CtEa10tHeOGmr83lXoaRZ1ACcDpl5
VOHvbmhM+kiSARaXKPg5ZfU=
=jz33
-----END PGP SIGNATURE-----

#16David Fetter
david@fetter.org
In reply to: Ron Johnson (#15)
Re: PG not rejecting bad dates (was Re: Finding bogus dates)

On Thu, Jan 18, 2007 at 06:05:37PM -0600, Ron Johnson wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:52, David Fetter wrote:

On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:

On 01/18/07 17:22, Scott Ribe wrote:

But this won't work if one had a text column of dates in various
formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time. This smacks of something that MySQL would do...

I'm pretty sure it didn't accept these as bad dates, but as text
strings. As you point out, it's a MySQLism to take "we are all here
to go into space" as a valid date.

Ah, the relevant snippet from OP:

I have a varchar column

That would explain everything. Except why it's a VARCHAR instead of
DATE. But that's a whole 'nother discussion.

As I understood OP, it's a staging table :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

#17Scott Ribe
scott_ribe@killerbytes.com
In reply to: David Fetter (#16)
Re: PG not rejecting bad dates (was Re: Finding bogus

That would explain everything. Except why it's a VARCHAR instead of
DATE. But that's a whole 'nother discussion.

As I understood OP, it's a staging table :)

Right. And it's exactly because the original source has bogus data that I
need a staging table to load it up and study it and fix it.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice