Issue in to_timestamp/to_date while handling the quoted literal string

Started by Suraj Kharageover 6 years ago3 messages
#1Suraj Kharage
suraj.kharage@enterprisedb.com
1 attachment(s)

Hi,

I noticed the issue in to_timestamp()/to_date() while handling the double
quote literal string. If any double quote literal characters found in
format, we generate the NODE_TYPE_CHAR in parse format and store that
actual character in FormatNode->character. n DCH_from_char, we just
increment the input string by length of character for NODE_TYPE_CHAR.
We are actually not matching these characters in input string and because
of this, date values get changed if quoted literal string is not identical
in input and format string.

e.g:

postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
'yyyy-mm-dd"TT"hh24:mi:ss');
to_timestamp
---------------------------
2019-05-24 03:12:45+05:30
(1 row)

In above example, the quoted string is 'TT', so it just increment the input
string by 2 while handling these characters and returned the wrong hour
value.

My suggestion is to match the exact characters from quoted literal string
in input string and if doesn't match then throw an error.

Attached is the POC patch which almost works for all scenarios except for
whitespace - as a quote character.

Suggestions?
--
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.

Attachments:

to_timestamp_quoted_string_POC.patchapplication/octet-stream; name=to_timestamp_quoted_string_POC.patchDownload
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index b3115e4..4fed87f 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -167,6 +167,7 @@ typedef struct
 #define NODE_TYPE_CHAR		3
 #define NODE_TYPE_SEPARATOR	4
 #define NODE_TYPE_SPACE		5
+#define NODE_TYPE_QUOTE_CHAR 6
 
 #define SUFFTYPE_PREFIX		1
 #define SUFFTYPE_POSTFIX	2
@@ -1325,7 +1326,7 @@ parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 					if (*str == '\\' && *(str + 1))
 						str++;
 					chlen = pg_mblen(str);
-					n->type = NODE_TYPE_CHAR;
+					n->type = NODE_TYPE_QUOTE_CHAR;
 					memcpy(n->character, str, chlen);
 					n->character[chlen] = '\0';
 					n->key = NULL;
@@ -3070,6 +3071,20 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out)
 			}
 			continue;
 		}
+		else if(n->type == NODE_TYPE_QUOTE_CHAR)
+		{
+			int char_len = strlen(n->character);
+
+			if (strncmp(s, n->character, char_len) == 0)
+			{
+				s = s + char_len;
+				continue;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("literal does not match the format string")));
+		}
 		else if (n->type != NODE_TYPE_ACTION)
 		{
 			/*
#2Brendan Jurd
direvus@gmail.com
In reply to: Suraj Kharage (#1)
Re: Issue in to_timestamp/to_date while handling the quoted literal string

Hi Suraj,

I think the documentation is reasonably clear about this behaviour, quote:

" In to_date, to_number, and to_timestamp, literal text and double-quoted
strings result in skipping the number of characters contained in the
string; for example "XX" skips two input characters (whether or not they
are XX)."

I can appreciate that this isn't the behaviour you intuitively expected
from to_timestamp, and I don't think you'd be the first or the last. The
purpose of these functions was never to validate that your input string
precisely matches the non-coding parts of your format pattern. For that, I
think you'd be better served by using regular expressions.

Just as an aside, in the example you gave, the string '2019-05-24T23:12:45'
will cast directly to timestamp just fine, so it isn't the kind of
situation to_timestamp was really intended for. It's more for when your
input string is in an obscure (or ambiguous) format that is known to you in
advance.

I hope that helps.

Cheers
Brendan

On Wed, 24 Jul 2019 at 21:38, Suraj Kharage <suraj.kharage@enterprisedb.com>
wrote:

Show quoted text

Hi,

I noticed the issue in to_timestamp()/to_date() while handling the double
quote literal string. If any double quote literal characters found in
format, we generate the NODE_TYPE_CHAR in parse format and store that
actual character in FormatNode->character. n DCH_from_char, we just
increment the input string by length of character for NODE_TYPE_CHAR.
We are actually not matching these characters in input string and because
of this, date values get changed if quoted literal string is not identical
in input and format string.

e.g:

postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
'yyyy-mm-dd"TT"hh24:mi:ss');
to_timestamp
---------------------------
2019-05-24 03:12:45+05:30
(1 row)

In above example, the quoted string is 'TT', so it just increment the
input string by 2 while handling these characters and returned the wrong
hour value.

My suggestion is to match the exact characters from quoted literal string
in input string and if doesn't match then throw an error.

Attached is the POC patch which almost works for all scenarios except for
whitespace - as a quote character.

Suggestions?
--
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.

#3Suraj Kharage
suraj.kharage@enterprisedb.com
In reply to: Brendan Jurd (#2)
Re: Issue in to_timestamp/to_date while handling the quoted literal string

Thanks for the clarification Brendan, that really helps.

On Wed, Jul 24, 2019 at 6:24 PM Brendan Jurd <direvus@gmail.com> wrote:

Hi Suraj,

I think the documentation is reasonably clear about this behaviour, quote:

" In to_date, to_number, and to_timestamp, literal text and double-quoted
strings result in skipping the number of characters contained in the
string; for example "XX" skips two input characters (whether or not they
are XX)."

I can appreciate that this isn't the behaviour you intuitively expected
from to_timestamp, and I don't think you'd be the first or the last. The
purpose of these functions was never to validate that your input string
precisely matches the non-coding parts of your format pattern. For that, I
think you'd be better served by using regular expressions.

Just as an aside, in the example you gave, the string
'2019-05-24T23:12:45' will cast directly to timestamp just fine, so it
isn't the kind of situation to_timestamp was really intended for. It's
more for when your input string is in an obscure (or ambiguous) format that
is known to you in advance.

I hope that helps.

Cheers
Brendan

On Wed, 24 Jul 2019 at 21:38, Suraj Kharage <
suraj.kharage@enterprisedb.com> wrote:

Hi,

I noticed the issue in to_timestamp()/to_date() while handling the double
quote literal string. If any double quote literal characters found in
format, we generate the NODE_TYPE_CHAR in parse format and store that
actual character in FormatNode->character. n DCH_from_char, we just
increment the input string by length of character for NODE_TYPE_CHAR.
We are actually not matching these characters in input string and because
of this, date values get changed if quoted literal string is not identical
in input and format string.

e.g:

postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
'yyyy-mm-dd"TT"hh24:mi:ss');
to_timestamp
---------------------------
2019-05-24 03:12:45+05:30
(1 row)

In above example, the quoted string is 'TT', so it just increment the
input string by 2 while handling these characters and returned the wrong
hour value.

My suggestion is to match the exact characters from quoted literal string
in input string and if doesn't match then throw an error.

Attached is the POC patch which almost works for all scenarios except for
whitespace - as a quote character.

Suggestions?
--
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.

--
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.