surprising to_timestamp behavior

Started by Robert Haasabout 12 years ago8 messages

robertmhaas@gmail.com

about 12 years ago

1 attachment(s)

It turns out that when you use the to_timestamp function, a space in
the format mask can result in skipping any character at all, even a
digit, in the input string. Consider this example, where 10 hours are
lost:

rhaas=# select to_timestamp('2013-10-29 10:47:18', 'YYYY-MM-DD HH24:MI:SS');
to_timestamp
------------------------
2013-10-29 00:47:18-04
(1 row)

There's already some code in the relevant function (DCH_from_char) to
skip whitespace characters, but that happens *in addition* to the
mandatory one-character skip. I suggest that we revamp the logic so
that we *either* skip whitespace *or* skip exactly one character from
the input string - but not both. That fixes this case, because the
first space in the mask matches the input space, and the second one is
allowed to match nothing instead of being forced to consume the "1"
following the space. See attached patch.

One problem with this is that it breaks the regression tests. Right
now, this works:

SELECT to_timestamp('My birthday-> Year: 1976, Month: May, Day: 16',
'"My birthday-> Year" YYYY, "Month:" FMMonth, "Day:" DD');

First, the quoted string matches exactly. Next, the subsequent space
in the format mask skips the colon in the input. And now it's trying
to match YYYY against " 1976", and it's happy to ignore that leading
space as part of decoding a numeric field. With the proposed patch,
we're no longer willing to match the colon in the input to the space
in the format mask, so it errors out, complaining that the colon
doesn't look like a number. I tend to think that's more correct than
what we're doing now.

The other regression test that fails is:

SELECT to_timestamp('15 "text between quote marks" 98 54 45',
E'HH24 "\\text between quote marks\\"" YY MI SS');

With the patch, this fails with:

ERROR: invalid value "rk" for "YY"

The problem is basically that we're quite happy to skip quoted text
that doesn't match at all, as long as it's the same length. And the
format string has extra cruft in it, so they're not, and we end up in
the wrong spot. Either I'm missing something, or our current behavior
here is nothing short of bizarre.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

to_timestamp_ws.patchtext/x-patch; charset=US-ASCII; name=to_timestamp_ws.patchDownload

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 36353c3..d00b82c 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -2846,13 +2846,18 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
 	{
 		if (n->type != NODE_TYPE_ACTION)
 		{
-			s++;
-			/* Ignore spaces when not in FX (fixed width) mode */
+			/*
+			 * Unless we're in fixed-width mode, upon encountering a space
+			 * in the format string, skip 0 or more whitespace characters
+			 * from the input string.
+			 */
 			if (isspace((unsigned char) n->character) && !fx_mode)
 			{
 				while (*s != '\0' && isspace((unsigned char) *s))
 					s++;
 			}
+			else
+				s++;		/* Otherwise, skip exactly one character. */
 			continue;
 		}

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Robert Haas (#1)

Re: surprising to_timestamp behavior

Robert Haas <robertmhaas@gmail.com> writes:

It turns out that when you use the to_timestamp function, a space in
the format mask can result in skipping any character at all, even a
digit, in the input string. Consider this example, where 10 hours are
lost:

rhaas=# select to_timestamp('2013-10-29 10:47:18', 'YYYY-MM-DD HH24:MI:SS');
to_timestamp
------------------------
2013-10-29 00:47:18-04
(1 row)

And that's a bug why? The format says to ignore two characters before the
hours field. I think you're proposing to remove important functionality.

To refine the point a bit, it's absolutely stupid to be using to_timestamp
at all for sane input data like this example. Just cast the string to
timestamp(tz), and the standard datatype input function will do a better
job than to_timestamp ever would. The point of to_timestamp, IMNSHO,
is to extract data successfully from weirdly formatted input; which might
well include cases where there are stray digits you don't want taken as
data. So I'm not on board with proposals to "fix" cases like this by
making the format string's meaning squishier.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Tom Lane (#2)

Re: surprising to_timestamp behavior

On Tue, Oct 29, 2013 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It turns out that when you use the to_timestamp function, a space in
the format mask can result in skipping any character at all, even a
digit, in the input string. Consider this example, where 10 hours are
lost:

rhaas=# select to_timestamp('2013-10-29 10:47:18', 'YYYY-MM-DD HH24:MI:SS');
to_timestamp
------------------------
2013-10-29 00:47:18-04
(1 row)

And that's a bug why? The format says to ignore two characters before the
hours field. I think you're proposing to remove important functionality.

To refine the point a bit, it's absolutely stupid to be using to_timestamp
at all for sane input data like this example. Just cast the string to
timestamp(tz), and the standard datatype input function will do a better
job than to_timestamp ever would. The point of to_timestamp, IMNSHO,
is to extract data successfully from weirdly formatted input; which might
well include cases where there are stray digits you don't want taken as
data. So I'm not on board with proposals to "fix" cases like this by
making the format string's meaning squishier.

Well, you're the second person to react that way to this proposal, but
the current behavior seems mighty odd to me - even odder, now that I
realize that we'll happily match '"cat'" to 'dog'. I just work here,
though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jeevan Chalke

jeevan.chalke@enterprisedb.com

about 12 years ago

In reply to: Robert Haas (#3)

Re: surprising to_timestamp behavior

On Tue, Oct 29, 2013 at 11:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 29, 2013 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It turns out that when you use the to_timestamp function, a space in
the format mask can result in skipping any character at all, even a
digit, in the input string. Consider this example, where 10 hours are
lost:

rhaas=# select to_timestamp('2013-10-29 10:47:18', 'YYYY-MM-DD

HH24:MI:SS');

to_timestamp
------------------------
2013-10-29 00:47:18-04
(1 row)

And that's a bug why? The format says to ignore two characters before

the

hours field. I think you're proposing to remove important functionality.

To refine the point a bit, it's absolutely stupid to be using

to_timestamp

at all for sane input data like this example. Just cast the string to
timestamp(tz), and the standard datatype input function will do a better
job than to_timestamp ever would. The point of to_timestamp, IMNSHO,
is to extract data successfully from weirdly formatted input; which might
well include cases where there are stray digits you don't want taken as
data. So I'm not on board with proposals to "fix" cases like this by
making the format string's meaning squishier.

Well, you're the second person to react that way to this proposal, but
the current behavior seems mighty odd to me - even odder, now that I
realize that we'll happily match '"cat'" to 'dog'. I just work here,
though.

Well, I agree with Tom that user provided two spaces to skip before hours
and this is what we are exactly doing.

Still here are few other observations:

(1) I don't see following as wrong output in postgresql as I already said
above and agreed with Tom. (in input, only one space between DD and HH24,
but
in mask we have 2 spaces)

postgres=# select to_timestamp('2011-03-18 23:38:15', 'YYYY-MM-DD
HH24:MI:SS');
to_timestamp
---------------------------
2011-03-18 03:38:15+05:30
(1 row)

(Note that, time 23 became 03, due to extra space in mask eating 2 in 23,
resulting in 3 for HH24. But fair enough, as expected and thus NO issues)

(2) But I see following buggy (both in input and mask we have 2 spaces
between DD and HH24)

postgres=# select to_timestamp('2011-03-18 23:38:15', 'YYYY-MM-DD
HH24:MI:SS');
to_timestamp
---------------------------
2011-03-18 03:38:15+05:30
(1 row)

(Note that, this time we should not end up with eating 2 from 23 as we have
exact spaces in mask and input. NOT so good and NOT expected, looks like
BUG)

So I think we need to resolve second case.

Thanks

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Jeevan Chalke

jeevan.chalke@enterprisedb.com

almost 12 years ago

In reply to: Jeevan Chalke (#4)

Re: [BUGS] surprising to_timestamp behavior

On Thu, Oct 31, 2013 at 10:50 AM, Jeevan Chalke <
jeevan.chalke@enterprisedb.com> wrote:

On Tue, Oct 29, 2013 at 11:05 PM, Robert Haas <robertmhaas@gmail.com>wrote:

On Tue, Oct 29, 2013 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It turns out that when you use the to_timestamp function, a space in
the format mask can result in skipping any character at all, even a
digit, in the input string. Consider this example, where 10 hours are
lost:

rhaas=# select to_timestamp('2013-10-29 10:47:18', 'YYYY-MM-DD

HH24:MI:SS');

to_timestamp
------------------------
2013-10-29 00:47:18-04
(1 row)

And that's a bug why? The format says to ignore two characters before

the

hours field. I think you're proposing to remove important

functionality.

To refine the point a bit, it's absolutely stupid to be using

to_timestamp

at all for sane input data like this example. Just cast the string to
timestamp(tz), and the standard datatype input function will do a better
job than to_timestamp ever would. The point of to_timestamp, IMNSHO,
is to extract data successfully from weirdly formatted input; which

might

well include cases where there are stray digits you don't want taken as
data. So I'm not on board with proposals to "fix" cases like this by
making the format string's meaning squishier.

Well, you're the second person to react that way to this proposal, but
the current behavior seems mighty odd to me - even odder, now that I
realize that we'll happily match '"cat'" to 'dog'. I just work here,
though.

Well, I agree with Tom that user provided two spaces to skip before hours
and this is what we are exactly doing.

Still here are few other observations:

(1) I don't see following as wrong output in postgresql as I already said
above and agreed with Tom. (in input, only one space between DD and HH24,
but
in mask we have 2 spaces)

postgres=# select to_timestamp('2011-03-18 23:38:15', 'YYYY-MM-DD
HH24:MI:SS');
to_timestamp
---------------------------
2011-03-18 03:38:15+05:30
(1 row)

(Note that, time 23 became 03, due to extra space in mask eating 2 in 23,
resulting in 3 for HH24. But fair enough, as expected and thus NO issues)

(2) But I see following buggy (both in input and mask we have 2 spaces
between DD and HH24)

postgres=# select to_timestamp('2011-03-18 23:38:15', 'YYYY-MM-DD
HH24:MI:SS');
to_timestamp
---------------------------
2011-03-18 03:38:15+05:30
(1 row)

(Note that, this time we should not end up with eating 2 from 23 as we have
exact spaces in mask and input. NOT so good and NOT expected, looks like
BUG)

So I think we need to resolve second case.

Attached patch which fixes this issue.

I have just tweaked the code around ignoring spaces in DCH_from_char()
function.

Adding to CommitFest 2014-01 (Open).

Thanks

Thanks

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Jeevan Chalke (#5)

Re: [HACKERS] surprising to_timestamp behavior

Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:

Attached patch which fixes this issue.

I have just tweaked the code around ignoring spaces in DCH_from_char()
function.

Adding to CommitFest 2014-01 (Open).

I went to review this, and found that there's not actually a patch
attached ...

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Jeevan Chalke

jeevan.chalke@enterprisedb.com

almost 12 years ago

In reply to: Tom Lane (#6)

1 attachment(s)

Re: [BUGS] surprising to_timestamp behavior

I went to review this, and found that there's not actually a patch
attached ...

regards, tom lane

Attached. Sorry for that.

--
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Attachments:

fix_surprising_to_timestamp_behavior.patchtext/x-diff; charset=US-ASCII; name=fix_surprising_to_timestamp_behavior.patchDownload

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 6ae426b..56fb359 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -2846,16 +2846,18 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
 	{
 		if (n->type != NODE_TYPE_ACTION)
 		{
+			/* Separator, consume one character from input string */
 			s++;
-			/* Ignore spaces when not in FX (fixed width) mode */
-			if (isspace((unsigned char) n->character) && !fx_mode)
-			{
-				while (*s != '\0' && isspace((unsigned char) *s))
-					s++;
-			}
 			continue;
 		}
 
+		/* Ignore spaces when not in FX (fixed width) mode */
+		if (!fx_mode)
+		{
+			while (*s != '\0' && isspace((unsigned char) *s))
+				s++;
+		}
+
 		from_char_set_mode(out, n->key->date_mode);
 
 		switch (n->key->id)
diff --git a/src/test/regress/expected/horology.out b/src/test/regress/expected/horology.out
index 87a6951..9ebc0f7 100644
--- a/src/test/regress/expected/horology.out
+++ b/src/test/regress/expected/horology.out
@@ -2965,3 +2965,76 @@ SELECT to_char('2012-12-12 12:00'::timestamptz, 'YYYY-MM-DD HH:MI:SS TZ');
 (1 row)
 
 RESET TIME ZONE;
+-- White spaces in input string, not matching with format string
+SELECT to_timestamp('2011-12-18 23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 03:38:15 2011 PST
+(1 row)
+
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 23:38:15 2011 PST
+(1 row)
+
+SELECT to_timestamp('2011-12-18   23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 23:38:15 2011 PST
+(1 row)
+
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 23:38:15 2011 PST
+(1 row)
+
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 23:38:15 2011 PST
+(1 row)
+
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD   HH24:MI:SS');
+         to_timestamp         
+------------------------------
+ Sun Dec 18 03:38:15 2011 PST
+(1 row)
+
+SELECT to_date('2011 12  18', 'YYYY MM DD');
+  to_date   
+------------
+ 12-18-2011
+(1 row)
+
+SELECT to_date('2011 12  18', 'YYYY MM  DD');
+  to_date   
+------------
+ 12-18-2011
+(1 row)
+
+SELECT to_date('2011 12  18', 'YYYY MM   DD');
+  to_date   
+------------
+ 12-08-2011
+(1 row)
+
+SELECT to_date('2011 12 18', 'YYYY  MM DD');
+  to_date   
+------------
+ 02-18-2011
+(1 row)
+
+SELECT to_date('2011  12 18', 'YYYY  MM DD');
+  to_date   
+------------
+ 12-18-2011
+(1 row)
+
+SELECT to_date('2011   12 18', 'YYYY  MM DD');
+  to_date   
+------------
+ 12-18-2011
+(1 row)
+
diff --git a/src/test/regress/sql/horology.sql b/src/test/regress/sql/horology.sql
index fe9a520..19c1cf4 100644
--- a/src/test/regress/sql/horology.sql
+++ b/src/test/regress/sql/horology.sql
@@ -477,3 +477,20 @@ SELECT '2012-12-12 12:00 America/New_York'::timestamptz;
 SELECT to_char('2012-12-12 12:00'::timestamptz, 'YYYY-MM-DD HH:MI:SS TZ');
 
 RESET TIME ZONE;
+
+-- White spaces in input string, not matching with format string
+SELECT to_timestamp('2011-12-18 23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+SELECT to_timestamp('2011-12-18   23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD HH24:MI:SS');
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD  HH24:MI:SS');
+SELECT to_timestamp('2011-12-18  23:38:15', 'YYYY-MM-DD   HH24:MI:SS');
+
+SELECT to_date('2011 12  18', 'YYYY MM DD');
+SELECT to_date('2011 12  18', 'YYYY MM  DD');
+SELECT to_date('2011 12  18', 'YYYY MM   DD');
+
+SELECT to_date('2011 12 18', 'YYYY  MM DD');
+SELECT to_date('2011  12 18', 'YYYY  MM DD');
+SELECT to_date('2011   12 18', 'YYYY  MM DD');

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Jeevan Chalke (#7)

Re: [HACKERS] surprising to_timestamp behavior

Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:

I went to review this, and found that there's not actually a patch
attached ...

Attached. Sorry for that.

This looks good to me except for one thing: if the upcoming node is a
DCH_FX action (ie, turn on fx_mode), I don't think we want to skip
whitespace. The original coding had that bug too, but it was only
exposed if the FX prefix was immediately preceded by whitespace in
the format.

It's a bit hard to think of useful cases where this would make a
difference, because it turns out that from_char_parse_int contains
an internal skip-whitespace action that's applied regardless of FX
mode, so it really only matters for non-integer field types. I
kinda think that maybe now we should remove that extra skip, or at
least count the skipped whitespace against the field width, but
did not experiment to see what the results would be.

Committed with that change and some cosmetic adjustments.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs