to_date() validation
Hi all,
Well, it's been a long time coming, but I've finally got a patch to
improve the error handling of to_date() and to_timestamp(), as
previously discussed on -hackers. [1]http://archives.postgresql.org/pgsql-hackers/2007-02/msg00915.php[2]http://archives.postgresql.org/message-id/37ed240d0707170747p4f5c26ffx63fff2b5750c62e5@mail.gmail.com
The patch is against HEAD, and compiles cleanly and passes all
regression tests on gentoo amd64.
I did some testing to see whether the extra error checks had an
adverse effect on performance. It turns out that my patch actually
performs materially *better*, somehow. About 14.5% better over 1M
calls to to_date in a single statement. Go figure (C compiler
optimisations are completely arcane to me).
What the patch does
====
With this patch I've tried to make DCH_from_char a bit smarter with
regard to parsing and validation of the user input. The goal here is
that if the user makes a mistake and enters something invalid,
to_date() will throw an meaningful error instead of just acting like
everything is okay and then producing an absurd answer.
I've modified the code to error out usefully in a few different scenarios:
* Non-integer value for an integer field
* Out-of-range value for an integer field
* Not enough characters in the input string to provide for all the
format fields
* Two conflicting values given for a field
* Illegal mixture of date conventions (ISO week date and Gregorian)
I've included new regression tests to check that these errors are
occurring when appropriate.
How I went about it
====
In order to detect an illegal mixture of Gregorian and ISO week date
conventions, the code needs to know which fomatting fields are
Gregorian, which are ISO WD and which are date-convention-neutral. To
get this working I added a new enum called FromCharDateMode, and added
a field to both KeyWord and TmFromChar to hold the date mode. In
KeyWord, it tells you which convention the keyword belongs to (YYYY is
Gregorian, IYYY is ISO WD and HH24 has nothing to do with dates). In
TmFromChar it keeps track of which convention is currently in use.
Because TmFromChar now tracks the date mode independently, it is not
necessary to maintain separate fields for the ISO WD values, so
'iyear' and 'iw' are removed.
In order to get some consistent treatment on pulling integers out of a
string and stuffing them into a TmFromChar, I had to refactor
DCH_from_char somewhat. The function is basically a giant switch
statement, with branches for each of the formatting fields. In HEAD
these branches are highly duplicative. The integer fields all perform
a sscanf() on the source string, putting the result straight into the
TmFromChar struct, and then skipping over the number of characters
consumed by sscanf(). I moved this logic into functions:
* from_char_parse_int() uses strtol() rather than sscanf() to parse
an integer from the string, and produces meaningful errors if
something is wrong with the input,
* from_char_set_int() saves an integer into a TmFromChar field, but
throws an error if the field has already been set to some other
non-zero value.
What the patch doesn't do
====
It doesn't consider to_number() at all. I don't have any interest in
the number formatting stuff.
There is plenty of room to educate do_to_timestamp() further about
dates; it's still very naive. Currently you can do something like
to_date('2008-50', 'YYYY-MM') and it won't even bat an eyelid. It'll
just advance the date by 50 months. I suppose you could consider that
a bug or a feature, depending on your point of view. Personally, I
think of it as a bug. If you're giving a month value outside of 1-12
to to_date(), chances are good that it's a user error in the query,
not a deliberate attempt to perform some date arithmetic.
do_to_timestamp() also doesn't try to detect logical mistakes like
'YYYY-MM-DD DDD' where the DDD part conflicts with the MM-DD part. I
have some thoughts about how I would improve on that, but I think it's
best left for a separate patch.
The patch has been added to the September commitfest. Thanks for your time.
Cheers,
BJ
[1]: http://archives.postgresql.org/pgsql-hackers/2007-02/msg00915.php
[2]: http://archives.postgresql.org/message-id/37ed240d0707170747p4f5c26ffx63fff2b5750c62e5@mail.gmail.com
Attachments:
On Fri, Aug 29, 2008 at 7:39 PM, Brendan Jurd <direvus@gmail.com> wrote:
Hi all,
Well, it's been a long time coming, but I've finally got a patch to
improve the error handling of to_date() and to_timestamp(), as
previously discussed on -hackers. [1][2]
BTW -patches is obsolete just submit pathces to -hackers.
Im just following this:
http://wiki.postgresql.org/wiki/Reviewing_a_Patch so lets get started.
Submission review: Everything looks good. Tests seem reasonable patch
applies etc.
Usability review: I think both linked threads in the parent mail give
sufficient justification.
Feature test: Everything seems to work as advertised. However before
via sscanf() most limited the max length of the input. Before they
were just silently ignored now they get added to the result:
patched:
# SELECT to_timestamp('11111', 'HHMM');
to_timestamp
------------------------------
0009-03-19 11:00:00-06:59:56
8.3.3:
# SELECT to_timestamp('11111', 'HHMM');
to_timestamp
---------------------------
0001-11-01 11:00:00-07 BC
Performance review: simple pgbench of to_timestamp did not show any
noticeable differences
Coding review: seemed fine to me, code matched surrounding style,
comments made sense etc
Code review: one minor nit
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
***************
*** 781,787 **** static const KeyWord DCH_keywords[] = {
{"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
/* last */
! {NULL, 0, 0, 0}
};
/* ----------
--- 781,787 ----
{"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},
/* last */
! {NULL, 0, 0, 0, 0}
};
/* ----------
***************
On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
Im just following this:
http://wiki.postgresql.org/wiki/Reviewing_a_Patch so lets get started.
Hi Alex. Thanks for taking the time to review my patch.
Feature test: Everything seems to work as advertised. However before
via sscanf() most limited the max length of the input. Before they
were just silently ignored now they get added to the result:patched:
# SELECT to_timestamp('11111', 'HHMM');
to_timestamp
------------------------------
0009-03-19 11:00:00-06:59:568.3.3:
# SELECT to_timestamp('11111', 'HHMM');
to_timestamp
---------------------------
0001-11-01 11:00:00-07 BC
It's an interesting point. I had considered what would happen if the
input string was too short, but hadn't given much thought to what
would happen if it was too long.
In your example case, it isn't clear whether the user wanted to
specify 11 hours and 11 months (and the final '1' is just junk), or if
they really wanted to specify 11 hours and 111 months.
But consider a case like the following:
# SELECT to_date('07-09-20008', 'DD-MM-YYYY');
The documentation says that 'YYYY' is "4 and more digits", so you have
to assume that the user meant to say the year 20,008. That's why the
code in the patch parses all the remaining digits in the input string
on the final node.
HEAD actually gets this one wrong; in defiance of the documentation it
returns 2000-09-07. So, it seems to me that the patch shifts the
behaviour in the right direction.
Barring actually teaching the code that some nodes (like YYYY) can
take an open-ended number of characters, while others (like MM) must
take an exact number of characters, I'm not sure what can be done to
improve this. Perhaps something for a later patch?
Code review: one minor nit *** a/src/backend/utils/adt/formatting.c --- b/src/backend/utils/adt/formatting.c *************** *** 781,787 **** static const KeyWord DCH_keywords[] = { {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},/* last */
! {NULL, 0, 0, 0}
};/* ---------- --- 781,787 ---- {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},/* last */
! {NULL, 0, 0, 0, 0}
};
Ah, thank you for picking that up. I will correct this and send in a
new patch version in the next 24 hours.
Cheers,
BJ
On Mon, Sep 8, 2008 at 6:24 PM, Brendan Jurd <direvus@gmail.com> wrote:
On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
Code review: one minor nit *** a/src/backend/utils/adt/formatting.c --- b/src/backend/utils/adt/formatting.c *************** *** 781,787 **** static const KeyWord DCH_keywords[] = { {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},/* last */
! {NULL, 0, 0, 0}
};/* ---------- --- 781,787 ---- {"y", 1, DCH_Y, TRUE, FROM_CHAR_DATE_GREGORIAN},/* last */
! {NULL, 0, 0, 0, 0}
};Ah, thank you for picking that up. I will correct this and send in a
new patch version in the next 24 hours.
New version attached (and link added to wiki).
Cheers,
BJ
Attachments:
On Mon, Sep 08, 2008 at 06:24:14PM +1000, Brendan Jurd wrote:
On Sun, Sep 7, 2008 at 5:58 AM, Alex Hunsaker <badalex@gmail.com> wrote:
Im just following this:
http://wiki.postgresql.org/wiki/Reviewing_a_Patch so lets get started.Hi Alex. Thanks for taking the time to review my patch.
I actually had a look at this patch also, though not as thoroughly as
Alex. There was one part that I had some thoughts about in from_char_parse_int_len:
! /*
! * We need to pull exactly the number of characters given in 'len' out
! * of the string, and convert those.
! */
<snip>
! first = palloc(len + 1);
! strncpy(first, init, len);
! first[len] = '\0';
The use of palloc/pfree in this routine seems excessive. Does len have
upper bound? If so a simple array will do it.
! if (strlen(first) < len)
Here you check the length of the remaining string and complain if it's
too short, but:
<snip>
! result = strtol(first, &last, 10);
! *src += (last - first);
Here you do not note if we didn't convert the entire string. So it
seems you are allowed to provide too few characters, as long as it's
not the last field?
Other than that it looks like a good patch.
Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
I actually had a look at this patch also, though not as thoroughly as
Alex. There was one part that I had some thoughts about in from_char_parse_int_len:
Hi Martijn. Thanks for your comments.
The use of palloc/pfree in this routine seems excessive. Does len have
upper bound? If so a simple array will do it.
There isn't any *theoretical* upper bound on len. However, in
practice, with the current set of formatting nodes, the largest len
that will ever be passed to rom_char_parse_int_len() is 6 (for the
microsecond 'US' node).
I suppose I could define a constant FORMATNODE_MAX_LEN, make it
something like 10 and just use that for copying the string, rather
than palloc(). I'll give it a try.
! if (strlen(first) < len)
Here you check the length of the remaining string and complain if it's
too short, but:<snip>
! result = strtol(first, &last, 10);
! *src += (last - first);Here you do not note if we didn't convert the entire string. So it
seems you are allowed to provide too few characters, as long as it's
not the last field?
That's true. The only way to hit that condition would be to provide a
semi-bogus value in a string with no separators between the numbers.
For example:
postgres=# SELECT to_date('20081o13', 'YYYYMMDD');
ERROR: invalid value for "DD" in source string
DETAIL: Value must be an integer.
What happens here is that strtol() happily consumes the '1' for the
format node MM, and as you point out it does not complain that it
expected to consume two characters and only got one. On the next node
(DD), the function tries to start parsing an integer, but the first
character it encounters is 'o', so it freaks out.
Certainly the error message here should be more apropos, and tell the
user that the problem is in the MM node. Blaming the DD node is pure
red herring.
However, if the mistake is at the end of the string, there is no error at all:
postgres=# SELECT to_date('2008101!', 'YYYYMMDD');
to_date
------------
2008-10-01
This is because the end of the string counts as a "separator", so we
just run an unbounded strtol() on whatever characters remain in the
string.
As discussed in my response to Alex's review, I made the end of the
string a separator so that things like 'DD-MM-YYYY' would actually
work for years with more than four digits.
Now I'm wondering if that was the wrong way to go. The case for years
with more than four digits is extremely narrow, and if somebody really
wanted to parse '01-01-20008' as 1 Jan 20,008 they could always use
the 'FM' modifier to get the behaviour they want ('DD-MM-FMYYYY').
I'll send in a new version which addresses these issues.
Cheers,
BJ
On Tue, Sep 9, 2008 at 9:04 PM, Brendan Jurd <direvus@gmail.com> wrote:
On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:The use of palloc/pfree in this routine seems excessive. Does len have
upper bound? If so a simple array will do it.I suppose I could define a constant FORMATNODE_MAX_LEN, make it
something like 10 and just use that for copying the string, rather
than palloc(). I'll give it a try.
Turns out there was already a relevant constant defined in
formatting.c: DCH_MAX_ITEM_SIZ, set to 9. So I just used that, +1 for
the trailing null.
Here you do not note if we didn't convert the entire string. So it
seems you are allowed to provide too few characters, as long as it's
not the last field?That's true. The only way to hit that condition would be to provide a
semi-bogus value in a string with no separators between the numbers.
I've now plugged this hole. I added the following error for
fixed-width inputs that are too short:
postgres=# SELECT to_date('200%1010', 'YYYYMMDD');
ERROR: invalid value for "YYYY" in source string
DETAIL: Field requires 4 characters, but only 3 could be parsed.
HINT: If your source string is not fixed-width, try using the "FM" modifier.
I've attached a new version of the patch (version 3), as well as an
incremental patch to show the differences between versions 2 and 3.
Cheers,
BJ
Attachments:
to-date-validation-2-to-3.diffapplication/octet-stream; name=to-date-validation-2-to-3.diffDownload
*** src/backend/utils/adt/formatting.c
--- src/backend/utils/adt/formatting.c
***************
*** 1679,1685 **** is_next_separator(FormatNode *n)
n++;
if (n->type == NODE_TYPE_END)
! return TRUE;
if (n->type == NODE_TYPE_ACTION)
{
--- 1679,1685 ----
n++;
if (n->type == NODE_TYPE_END)
! return FALSE;
if (n->type == NODE_TYPE_ACTION)
{
***************
*** 1811,1820 **** from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node)
* We need to pull exactly the number of characters given in 'len' out
* of the string, and convert those.
*/
! char *first;
char *last;
- first = palloc(len + 1);
strncpy(first, init, len);
first[len] = '\0';
--- 1811,1820 ----
* We need to pull exactly the number of characters given in 'len' out
* of the string, and convert those.
*/
! char first[DCH_MAX_ITEM_SIZ + 1];
char *last;
+ int used;
strncpy(first, init, len);
first[len] = '\0';
***************
*** 1830,1838 **** from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node)
"using the \"FM\" modifier.")));
result = strtol(first, &last, 10);
! *src += (last - first);
! pfree(first);
}
if (result == 0 && *src == init)
--- 1830,1848 ----
"using the \"FM\" modifier.")));
result = strtol(first, &last, 10);
! used = last - first;
!
! if (used > 0 && used < len)
! ereport(ERROR,
! (errcode(ERRCODE_INVALID_DATETIME_FORMAT),
! errmsg("invalid value for \"%s\" in source string",
! node->key->name),
! errdetail("Field requires %d characters, but only %d "
! "could be parsed.", len, used),
! errhint("If your source string is not fixed-width, try "
! "using the \"FM\" modifier.")));
! *src += used;
}
if (result == 0 && *src == init)
*** src/test/regress/expected/timestamp.out
--- src/test/regress/expected/timestamp.out
***************
*** 1729,1744 **** SELECT '' AS to_timestamp_23, to_timestamp('19971', 'YYYYMMDD');
ERROR: source string too short for "MM" formatting field
DETAIL: Field requires 2 characters, but only 1 remain.
HINT: If your source string is not fixed-width, try using the "FM" modifier.
-- Value clobbering:
! SELECT '' AS to_timestamp_24, to_timestamp('1997-11-Jan-16', 'YYYY-MM-Mon-DD');
ERROR: conflicting value for "Mon" field in formatting string
DETAIL: This value contradicts a previous setting for the same field type.
-- Non-numeric input:
! SELECT '' AS to_timestamp_24, to_timestamp('199711xy', 'YYYYMMDD');
ERROR: invalid value for "DD" in source string
DETAIL: Value must be an integer.
-- Input that doesn't fit in an int:
! SELECT '' AS to_timestamp_24, to_timestamp('10000000000', 'YYYY');
ERROR: value for "YYYY" in source string is out of range
DETAIL: Value must be in the range -2147483648 to 2147483647.
SET DateStyle TO DEFAULT;
--- 1729,1749 ----
ERROR: source string too short for "MM" formatting field
DETAIL: Field requires 2 characters, but only 1 remain.
HINT: If your source string is not fixed-width, try using the "FM" modifier.
+ -- Insufficient digit characters for a single node:
+ SELECT '' AS to_timestamp_24, to_timestamp('19971)24', 'YYYYMMDD');
+ ERROR: invalid value for "MM" in source string
+ DETAIL: Field requires 2 characters, but only 1 could be parsed.
+ HINT: If your source string is not fixed-width, try using the "FM" modifier.
-- Value clobbering:
! SELECT '' AS to_timestamp_25, to_timestamp('1997-11-Jan-16', 'YYYY-MM-Mon-DD');
ERROR: conflicting value for "Mon" field in formatting string
DETAIL: This value contradicts a previous setting for the same field type.
-- Non-numeric input:
! SELECT '' AS to_timestamp_26, to_timestamp('199711xy', 'YYYYMMDD');
ERROR: invalid value for "DD" in source string
DETAIL: Value must be an integer.
-- Input that doesn't fit in an int:
! SELECT '' AS to_timestamp_27, to_timestamp('10000000000', 'FMYYYY');
ERROR: value for "YYYY" in source string is out of range
DETAIL: Value must be in the range -2147483648 to 2147483647.
SET DateStyle TO DEFAULT;
*** src/test/regress/sql/timestamp.sql
--- src/test/regress/sql/timestamp.sql
***************
*** 279,291 **** SELECT '' AS to_timestamp_22, to_timestamp('2005527', 'YYYYIWID');
-- Insufficient characters in the source string:
SELECT '' AS to_timestamp_23, to_timestamp('19971', 'YYYYMMDD');
-- Value clobbering:
! SELECT '' AS to_timestamp_24, to_timestamp('1997-11-Jan-16', 'YYYY-MM-Mon-DD');
-- Non-numeric input:
! SELECT '' AS to_timestamp_24, to_timestamp('199711xy', 'YYYYMMDD');
-- Input that doesn't fit in an int:
! SELECT '' AS to_timestamp_24, to_timestamp('10000000000', 'YYYY');
SET DateStyle TO DEFAULT;
--- 279,294 ----
-- Insufficient characters in the source string:
SELECT '' AS to_timestamp_23, to_timestamp('19971', 'YYYYMMDD');
+ -- Insufficient digit characters for a single node:
+ SELECT '' AS to_timestamp_24, to_timestamp('19971)24', 'YYYYMMDD');
+
-- Value clobbering:
! SELECT '' AS to_timestamp_25, to_timestamp('1997-11-Jan-16', 'YYYY-MM-Mon-DD');
-- Non-numeric input:
! SELECT '' AS to_timestamp_26, to_timestamp('199711xy', 'YYYYMMDD');
-- Input that doesn't fit in an int:
! SELECT '' AS to_timestamp_27, to_timestamp('10000000000', 'FMYYYY');
SET DateStyle TO DEFAULT;
On Mon, Sep 8, 2008 at 2:24 AM, Brendan Jurd <direvus@gmail.com> wrote:
HEAD actually gets this one wrong; in defiance of the documentation it
returns 2000-09-07. So, it seems to me that the patch shifts the
behaviour in the right direction.Barring actually teaching the code that some nodes (like YYYY) can
take an open-ended number of characters, while others (like MM) must
take an exact number of characters, I'm not sure what can be done to
improve this. Perhaps something for a later patch?
Sound good to me and I would probably argue that things like MM should
not be hard coded to take only 2 chars...
But then again to play devils advocate I can just as easily do things
like to_char(...) + '30 months'::interval;
On Tue, Sep 9, 2008 at 6:46 AM, Brendan Jurd <direvus@gmail.com> wrote:
On Tue, Sep 9, 2008 at 9:04 PM, Brendan Jurd <direvus@gmail.com> wrote:
On Tue, Sep 9, 2008 at 7:29 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:The use of palloc/pfree in this routine seems excessive. Does len have
upper bound? If so a simple array will do it.
Good catch Martijn!
I suppose I could define a constant FORMATNODE_MAX_LEN, make it
something like 10 and just use that for copying the string, rather
than palloc(). I'll give it a try.Turns out there was already a relevant constant defined in
formatting.c: DCH_MAX_ITEM_SIZ, set to 9. So I just used that, +1 for
the trailing null.
Cool.
Here you do not note if we didn't convert the entire string. So it
seems you are allowed to provide too few characters, as long as it's
not the last field?That's true. The only way to hit that condition would be to provide a
semi-bogus value in a string with no separators between the numbers.I've now plugged this hole. I added the following error for
fixed-width inputs that are too short:postgres=# SELECT to_date('200%1010', 'YYYYMMDD');
ERROR: invalid value for "YYYY" in source string
DETAIL: Field requires 4 characters, but only 3 could be parsed.
HINT: If your source string is not fixed-width, try using the "FM" modifier.
I think thats a big improvement.
I've attached a new version of the patch (version 3), as well as an
incremental patch to show the differences between versions 2 and 3.
I looked it over, looks good to me!
Show quoted text
Cheers,
BJ
"Brendan Jurd" <direvus@gmail.com> writes:
I've attached a new version of the patch (version 3), as well as an
incremental patch to show the differences between versions 2 and 3.
Applied with minimal modifications. I changed a couple of error
messages that didn't seem to meet the style guidelines, and I moved all
of the to_timestamp tests into horology.sql, rather than having
essentially duplicate tests in timestamp.sql and timestamptz.sql.
regards, tom lane
On Fri, Sep 12, 2008 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Applied with minimal modifications. I changed a couple of error
messages that didn't seem to meet the style guidelines,
Great, thanks Tom. And thanks again to Alex and Martijn for helping
me refine the patch.
and I moved all
of the to_timestamp tests into horology.sql, rather than having
essentially duplicate tests in timestamp.sql and timestamptz.sql.
Nice. That will make maintaining/extending the tests easier in future.
Cheers,
BJ