Allow to_date() and to_timestamp() to accept localized names

Started by Juan José Santamaría Flechaover 6 years ago64 messageshackers
Jump to latest
#1Juan José Santamaría Flecha
juanjo.santamaria@gmail.com

Hello,

Going through the current items in the wiki's todo list, I have been
looking into: "Allow to_date () and to_timestamp () to accept
localized month names".

It seems to me that the code is almost there, so I would like to know
if something like the attached patch would be a reasonable way to go.

Regards,

Juan José Santamaría Flecha

Attachments:

0001-Allow-localized-month-names-to_date.patchapplication/octet-stream; name=0001-Allow-localized-month-names-to_date.patchDownload+115-45
#2Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Juan José Santamaría Flecha (#1)
Re: Allow to_date() and to_timestamp() to accept localized names

On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

Going through the current items in the wiki's todo list, I have been
looking into: "Allow to_date () and to_timestamp () to accept
localized month names".

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

Regards,

Juan José Santamaría Flecha

Attachments:

0001-Allow-localized-month-names-to_date-v1.patchapplication/octet-stream; name=0001-Allow-localized-month-names-to_date-v1.patchDownload+191-85
#3Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Juan José Santamaría Flecha (#2)
Re: Allow to_date() and to_timestamp() to accept localized names

On Thu, Aug 22, 2019 at 9:38 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

Going through the current items in the wiki's todo list, I have been
looking into: "Allow to_date () and to_timestamp () to accept
localized month names".

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

For reviewers, the current aproach for this patch is:

Break seq_search() into two functions:
* seq_search_sqlascii() that supports seq_search() current usage.
* seq_search_localized() similar to the previous but supports multibyte input.
To avoid code duplication most of current seq_search() logic has been
moved to a new function str_compare().
from_char_seq_search() is now responsible to choose between
seq_search_sqlascii() or seq_search_localized().
Also, since localized names is not a null terminated array,
seq_search() now receives the dimension as input and terminating nulls
have been removed from static arrays.

The commitfest item is:

https://commitfest.postgresql.org/24/2255/

Regards,

Juan José Santamaría Flecha

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Juan José Santamaría Flecha (#2)
Re: Allow to_date() and to_timestamp() to accept localized names

On 2019-Aug-22, Juan Jos� Santamar�a Flecha wrote:

On Sun, Aug 18, 2019 at 10:42 AM Juan Jos� Santamar�a Flecha
<juanjo.santamaria@gmail.com> wrote:

Going through the current items in the wiki's todo list, I have been
looking into: "Allow to_date () and to_timestamp () to accept
localized month names".

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we
just use lengthof() of the corresponding array? AFAICS it should work
just as well.

I wonder if the "compare first char" thing (seq_search_localized) really
works when there are multibyte chars in the day/month names. I think
the code compares just the first char ... but what if the original
string uses those funny Unicode non-normalized letters and the locale
translation uses normalized letters? My guess is that the first-char
comparison will fail, but surely you'll want the name to match.
(There's no month/day name in Spanish that doesn't start with an ASCII
letter, but I bet there are some in other languages.) I think the
localized comparison should avoid the first-char optimization, just
compare the whole string all the time, and avoid possible weird issues.

But then, I'm not 100% sure of this code. formatting.c is madness
distilled.

Thanks,

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Alvaro Herrera (#4)
Re: Allow to_date() and to_timestamp() to accept localized names

On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Thanks for taking a look at this.

I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we
just use lengthof() of the corresponding array? AFAICS it should work
just as well.

It was because of the length difference between ascii-name arrays,
which were all null-ended, and localized-name arrays. The attached
version uses lengthof().

I wonder if the "compare first char" thing (seq_search_localized) really
works when there are multibyte chars in the day/month names. I think
the code compares just the first char ... but what if the original
string uses those funny Unicode non-normalized letters and the locale
translation uses normalized letters? My guess is that the first-char
comparison will fail, but surely you'll want the name to match.
(There's no month/day name in Spanish that doesn't start with an ASCII
letter, but I bet there are some in other languages.) I think the
localized comparison should avoid the first-char optimization, just
compare the whole string all the time, and avoid possible weird issues.

The localized search is reformulated in this version to deal with
multibyte normalization. A regression test for this issue is included.

Regards,

Juan José Santamaría Flecha

Attachments:

0001-Allow-localized-month-names-to_date-v2.patchapplication/octet-stream; name=0001-Allow-localized-month-names-to_date-v2.patchDownload+181-49
#6Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Juan José Santamaría Flecha (#5)
Re: Allow to_date() and to_timestamp() to accept localized names

On Wed, Sep 18, 2019 at 11:09 AM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:

On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Thanks for taking a look at this.

I'm confused why we acquire the MONTH_DIM / etc definitions. Can't we
just use lengthof() of the corresponding array? AFAICS it should work
just as well.

It was because of the length difference between ascii-name arrays,
which were all null-ended, and localized-name arrays. The attached
version uses lengthof().

I wonder if the "compare first char" thing (seq_search_localized) really
works when there are multibyte chars in the day/month names. I think
the code compares just the first char ... but what if the original
string uses those funny Unicode non-normalized letters and the locale
translation uses normalized letters? My guess is that the first-char
comparison will fail, but surely you'll want the name to match.
(There's no month/day name in Spanish that doesn't start with an ASCII
letter, but I bet there are some in other languages.) I think the
localized comparison should avoid the first-char optimization, just
compare the whole string all the time, and avoid possible weird issues.

The localized search is reformulated in this version to deal with
multibyte normalization. A regression test for this issue is included.

This version is updated to optimize the need for dynamic allocation.

Show quoted text

Regards,

Juan José Santamaría Flecha

Attachments:

0001-Allow-localized-month-names-to_date-v3.patchapplication/x-patch; name=0001-Allow-localized-month-names-to_date-v3.patchDownload+181-49
#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Juan José Santamaría Flecha (#6)
Re: Allow to_date() and to_timestamp() to accept localized names

This patch no longer applies. Can you please rebase?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Alvaro Herrera (#7)
Re: Allow to_date() and to_timestamp() to accept localized names

On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

This patch no longer applies. Can you please rebase?

Thank you for the notification.

The patch rot after commit 1a950f3, a rebased version is attached.

Regards,

Juan José Santamaría Flecha

Attachments:

0001-Allow-localized-month-names-to_date-v4.patchapplication/x-patch; name=0001-Allow-localized-month-names-to_date-v4.patchDownload+184-42
#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Juan José Santamaría Flecha (#8)
Re: Allow to_date() and to_timestamp() to accept localized names

On Thu, Sep 26, 2019 at 08:36:25PM +0200, Juan Jos� Santamar�a Flecha wrote:

On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

This patch no longer applies. Can you please rebase?

Thank you for the notification.

The patch rot after commit 1a950f3, a rebased version is attached.

Thanks. I did a quick review of this patch, and I think it's almost RFC.
I only found a couple of minor issue:

- In func.sgml, it seems we've lost this bit:

<para>
<literal>TM</literal> does not include trailing blanks.
<function>to_timestamp</function> and <function>to_date</function> ignore
the <literal>TM</literal> modifier.
</para>

Does that mean the function no longer ignore the TM modifier? That
would be somewhat problematic (i.e. it might break some user code).
But looking at the code I don't see why the patch would have this
effect, so I suppose it's merely a doc bug.

- I don't think we need to include examples how to_timestmap ignores
case, I'd say just stating the fact is clear enough. But if we want to
have examples, I think we should not inline in the para but use the
established pattern:

<para>
Some examples:
<programlisting>
...
</programlisting>
</para>

which is used elsewhere in the func.sgml file.

- In formatting.c the "common/unicode_norm.h" should be right after
includes from "catalog/" to follow the alphabetic order (unless
there's a reason why that does not work).

- I rather dislike the "dim" parameter name, because I immediately think
"dimension" which makes little sense. I suggest renaming to "nitems"
or "nelements" or something like that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tomas Vondra (#9)
Re: Allow to_date() and to_timestamp() to accept localized names

On Sat, Jan 11, 2020 at 5:06 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Thanks. I did a quick review of this patch, and I think it's almost RFC.

Thanks for reviewing.

- In func.sgml, it seems we've lost this bit:

<para>
<literal>TM</literal> does not include trailing blanks.
<function>to_timestamp</function> and <function>to_date</function>
ignore
the <literal>TM</literal> modifier.
</para>

Does that mean the function no longer ignore the TM modifier? That
would be somewhat problematic (i.e. it might break some user code).
But looking at the code I don't see why the patch would have this
effect, so I suppose it's merely a doc bug.

It is intentional. This patch uses the TM modifier to identify the usage of
localized names as input for to_timestamp() and to_date().

- I don't think we need to include examples how to_timestmap ignores
case, I'd say just stating the fact is clear enough. But if we want to
have examples, I think we should not inline in the para but use the
established pattern:

<para>
Some examples:
<programlisting>
...
</programlisting>
</para>

which is used elsewhere in the func.sgml file.

I was trying to match the style surrounding the usage notes for date/time
formatting [1]https://www.postgresql.org/docs/current/functions-formatting.html. Agreed that it is not worth an example on its own, so
dropped.

- In formatting.c the "common/unicode_norm.h" should be right after
includes from "catalog/" to follow the alphabetic order (unless
there's a reason why that does not work).

Fixed.

- I rather dislike the "dim" parameter name, because I immediately think
"dimension" which makes little sense. I suggest renaming to "nitems"
or "nelements" or something like that.

Agreed, using "nelements" as a better style matchup.

Please, find attached a version addressing the above mentioned.

[1]: https://www.postgresql.org/docs/current/functions-formatting.html

Regards,

Juan José Santamaría Flecha

Show quoted text

Attachments:

0001-Allow-localized-month-names-to_date-v5.patchapplication/octet-stream; name=0001-Allow-localized-month-names-to_date-v5.patchDownload+184-44
#11Artur Zakirov
zaartur@gmail.com
In reply to: Juan José Santamaría Flecha (#10)
Re: Allow to_date() and to_timestamp() to accept localized names

Hello!

On 2020/01/13 21:04, Juan José Santamaría Flecha wrote:

Please, find attached a version addressing the above mentioned.

I have some couple picky notes.

+ if (name_len != norm_len)
+ pfree(norm_name);

I'm not sure here. Is it save to assume that if it was allocated a new
norm_name name_len and norm_len always will differ?

static const char *const months_full[] = {
"January", "February", "March", "April", "May", "June", "July",
-	"August", "September", "October", "November", "December", NULL
+	"August", "September", "October", "November", "December"
};

Unfortunately I didn't see from the patch why it was necessary to remove
last null element from months_full, days_short, adbc_strings,
adbc_strings_long and other arrays. I think it is not good because
unnecessarily increases the patch and adds code like the following:

+				from_char_seq_search(&value, &s, months, localized_names,
+									 ONE_UPPER, MAX_MON_LEN, n, have_error,
+									 lengthof(months_full));

Here it passes "months" from datetime.c to from_char_seq_search() and
calculates its length using "months_full" array from formatting.c.

--
Arthur

#12Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Artur Zakirov (#11)
Re: Allow to_date() and to_timestamp() to accept localized names

On Wed, Jan 15, 2020 at 11:20 AM Arthur Zakirov <zaartur@gmail.com> wrote:

I have some couple picky notes.

Thanks for the review.

+ if (name_len != norm_len)
+ pfree(norm_name);

I'm not sure here. Is it save to assume that if it was allocated a new
norm_name name_len and norm_len always will differ?

Good call, that check can be more robust.

static const char *const months_full[] = {
"January", "February", "March", "April", "May", "June", "July",
-     "August", "September", "October", "November", "December", NULL
+     "August", "September", "October", "November", "December"
};

Unfortunately I didn't see from the patch why it was necessary to remove
last null element from months_full, days_short, adbc_strings,
adbc_strings_long and other arrays. I think it is not good because
unnecessarily increases the patch and adds code like the following:

+ from_char_seq_search(&value, &s, months,

localized_names,

+

ONE_UPPER, MAX_MON_LEN, n, have_error,

+

lengthof(months_full));

Here it passes "months" from datetime.c to from_char_seq_search() and
calculates its length using "months_full" array from formatting.c.

With the introduction of seq_search_localized() that can be avoided,
minimizing code churn.

Please, find attached a version addressing the above mentioned.

Regards,

Juan José Santamaría Flecha

Show quoted text

Attachments:

0001-Allow-localized-month-names-to_date-v6.patchapplication/octet-stream; name=0001-Allow-localized-month-names-to_date-v6.patchDownload+164-33
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juan José Santamaría Flecha (#12)
Re: Allow to_date() and to_timestamp() to accept localized names

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:

[ 0001-Allow-localized-month-names-to_date-v6.patch ]

I took a quick look through this.

One thing I completely don't understand is why it's sufficient for
seq_search_localized to do "initcap" semantics. Shouldn't it cover
the all-upper and all-lower cases as well, as the existing seq_search
code does? (That is, why is it okay to ignore the "type" argument?)

On the other hand, you probably *should* be ignoring the "max"
argument, because AFAICS the values that are passed for that are
specific to the English ASCII spellings of the day and month names.
It seems very likely that they could be too small for some sets of
non-English names.

Related to that, the business with

+	mb_max = max * pg_encoding_max_length(encoding);
+	name_len = strlen(name);
+	name_len = name_len < mb_max ? name_len : mb_max;

seems wrong even if we assume that "max" is a sane limit on the number of
characters to consider. This coding is likely to truncate a long "name"
string in the middle of a multibyte character, leading to bad-encoding
errors from later functions.

I also am confused by the "if (mb_max > max ...)" bit. It looks to me
like that's an obscure way of writing "if (pg_encoding_max_length() > 1)",
which is a rather pointless check considering that the if() then goes on
to insist on encoding == PG_UTF8.

A bit further down, you have an "if (name_wlen > norm_wlen)"
condition that seems pretty fishy. Is it really guaranteed that
unicode_normalize_kc cannot transform the string without shortening it?
I don't see that specified in its header comment, for sure.

Also, it's purely accidental if this:

norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar));

allocates a long enough string for the conversion back to multibyte form,
because that output is not pg_wchars. I think you want something more
like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1". (I wonder whether we
need to be worrying about integer overflow in any of this.)

It seems like you could eliminate a lot of messiness by extending
localized_abbrev_days[] and related arrays by one element that's
always NULL. That would waste, hmm, four 8-byte pointers on typical
machines --- but you're eating a lot more than 32 bytes of code to
pass around the array lengths, plus it's really ugly that the plain
and localized arrays are handled differently.

I don't think the way you're managing the "localized_names" variable
in DCH_from_char is very sensible. The reader has to do a lot of
reverse engineering just to discover that the value is constant NULL
in most references, something that you've not helped by declaring
it outside the loop rather than inside. I think you could drop
the variable and write constant NULL in most places, with something
like
S_TM(n->suffix) ? localized_full_days : NULL
in those from_char_seq_search calls where it's relevant.

In general I'm displeased with the lack of attention to comments.
Notably, you added arguments to from_char_seq_search without updating
its header comment ... not that that comment is a great API spec, but
at the least you shouldn't be making it worse. I think the bug I
complained of above is directly traceable to the lack of a clear spec
here for what "max" means, so failure to keep these comments up to
speed does have demonstrable bad consequences.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: Allow to_date() and to_timestamp() to accept localized names

I wrote:

One thing I completely don't understand is why it's sufficient for
seq_search_localized to do "initcap" semantics. Shouldn't it cover
the all-upper and all-lower cases as well, as the existing seq_search
code does? (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way. But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus. In the first place, it scribbles on its source string,
which is surely not okay. You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR: invalid value "JanuzRY 1" for "MONTH"
DETAIL: The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit. It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching. pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead. So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

On the other hand, you probably *should* be ignoring the "max"
argument, because AFAICS the values that are passed for that are
specific to the English ASCII spellings of the day and month names.
It seems very likely that they could be too small for some sets of
non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well. It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior. So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages. There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints. I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether. Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths. That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched. In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it. It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

regards, tom lane

Attachments:

0001-formatting-fix-seq-search.patchtext/x-diff; charset=us-ascii; name=0001-formatting-fix-seq-search.patchDownload+51-59
0002-formatting-constify.patchtext/x-diff; charset=us-ascii; name=0002-formatting-constify.patchDownload+16-13
#15Artur Zakirov
zaartur@gmail.com
In reply to: Tom Lane (#14)
Re: Allow to_date() and to_timestamp() to accept localized names

On 2020/01/23 7:11, Tom Lane wrote:

Closer examination shows that the "max" argument is pretty bogus as
well. It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior. So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages. There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

Shouldn't we just show all remaining string instead of truncating it?
For example I get the following output:

=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY');
ERROR: invalid value "ЯНВА" for "MONTH"
DETAIL: The given value did not match any of the allowed values for
this field.

This behavior is reproduced without the patch though (on Postgres 12).

And the English case might confuse too I think:

=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR: invalid value "JANUZRY 1" for "MONTH"
DETAIL: The given value did not match any of the allowed values for
this field.

Users might think what means "1" in "JANUZRY 1" string.

I attached the draft patch, which shows all remaining string. So the
query above will show the following:

=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY');
ERROR: invalid value "ЯНВАРЯ 1999" for "MONTH"
DETAIL: The remaining value did not match any of the allowed values for
this field.

The 0001 patch attached cleans up all the above complaints. I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

+1 on the patch.

--
Arthur

Attachments:

0003-formatting-show-remaining.patchtext/plain; charset=UTF-8; name=0003-formatting-show-remaining.patchDownload+2-8
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Artur Zakirov (#15)
Re: Allow to_date() and to_timestamp() to accept localized names

Arthur Zakirov <zaartur@gmail.com> writes:

On 2020/01/23 7:11, Tom Lane wrote:

Closer examination shows that the "max" argument is pretty bogus as
well. It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior. So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages. There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

Shouldn't we just show all remaining string instead of truncating it?

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

regards, tom lane

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#16)
Re: Allow to_date() and to_timestamp() to accept localized names

On 2020-Jan-22, Tom Lane wrote:

Arthur Zakirov <zaartur@gmail.com> writes:

On 2020/01/23 7:11, Tom Lane wrote:

Closer examination shows that the "max" argument is pretty bogus as
well. It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior. So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages. There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

Shouldn't we just show all remaining string instead of truncating it?

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

I think it would be clearer to search for whitespace starting at the
start of the bogus token and stop there. It might not be perfect,
particularly if any language has whitespace in a month etc name (I don't
know any example but I guess it's not impossible for it to exist;
Portuguese weekday names maybe?), but it seems better than either of the
behaviors shown above.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#17)
Re: Allow to_date() and to_timestamp() to accept localized names

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On 2020-Jan-22, Tom Lane wrote:

Arthur Zakirov <zaartur@gmail.com> writes:

Shouldn't we just show all remaining string instead of truncating it?

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

I think it would be clearer to search for whitespace starting at the
start of the bogus token and stop there. It might not be perfect,
particularly if any language has whitespace in a month etc name (I don't
know any example but I guess it's not impossible for it to exist;
Portuguese weekday names maybe?), but it seems better than either of the
behaviors shown above.

I'm okay with that. It won't work so well for cases like
"1-Januzry-1999", but it's still better than what happens now:

regression=# select to_date('1-Januzry-1999', 'DD MONTH YYYY');
ERROR: invalid value "Januzry-1" for "MONTH"

That particular case could be improved by stopping at a dash ... but
since this code is also used to match strings like "A.M.", we can't
just exclude punctuation in general. Breaking at whitespace seems
like a reasonable compromise.

regards, tom lane

#19Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: Allow to_date() and to_timestamp() to accept localized names

On 2020-Jan-23, Tom Lane wrote:

That particular case could be improved by stopping at a dash ... but
since this code is also used to match strings like "A.M.", we can't
just exclude punctuation in general. Breaking at whitespace seems
like a reasonable compromise.

Yeah, and there are cases where dashes are used in names -- browsing
through glibc for example I quickly found Akan, for which the month names are:

mon "Sanda-<U0186>p<U025B>p<U0254>n";/
"Kwakwar-<U0186>gyefuo";/
"Eb<U0254>w-<U0186>benem";/

and so on. Even whitespace is problematic for some languages, such as Afan,

mon "Qunxa Garablu";/
"Naharsi Kudo";/
"Ciggilta Kudo";/
(etc) but I think whitespace-splitting is going to be more comprehensible
in the vast majority of cases, even if not perfect.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
Re: Allow to_date() and to_timestamp() to accept localized names

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Even whitespace is problematic for some languages, such as Afan,

mon "Qunxa Garablu";/
"Naharsi Kudo";/
"Ciggilta Kudo";/

(etc) but I think whitespace-splitting is going to be more comprehensible
in the vast majority of cases, even if not perfect.

Interesting. Given that to_date et al are often willing to ignore
whitespace in input, I wonder whether we won't have some other issues
with names like that --- not so much that basic cases wouldn't work,
as that people might reasonably expect that, say, we'd be flexible
about the amount of whitespace. Seems like a problem for another day
though.

In the meantime, I agree that "truncate at whitespace" is a better
heuristic for the error messages than what we've got. I'll go make it so.

regards, tom lane

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#19)
#22Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tom Lane (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Juan José Santamaría Flecha (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juan José Santamaría Flecha (#22)
#25Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#23)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#26)
#28Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#28)
#30Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tom Lane (#29)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: Juan José Santamaría Flecha (#30)
#32Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#32)
#34Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#33)
#35Peter Eisentraut
peter_e@gmx.net
In reply to: Mark Dilger (#34)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#33)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#35)
#38Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#37)
#39Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Mark Dilger (#38)
#40Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Juan José Santamaría Flecha (#39)
#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#35)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#38)
#43Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Juan José Santamaría Flecha (#39)
#46Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Alvaro Herrera (#41)
#47Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Peter Eisentraut (#45)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juan José Santamaría Flecha (#47)
#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#48)
#50Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tom Lane (#49)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Juan José Santamaría Flecha (#50)
#52James Coleman
jtc331@gmail.com
In reply to: Tom Lane (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: James Coleman (#52)
#54James Coleman
jtc331@gmail.com
In reply to: Tom Lane (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: James Coleman (#54)
#56Juan José Santamaría Flecha
juanjo.santamaria@gmail.com
In reply to: Tom Lane (#55)
#57James Coleman
jtc331@gmail.com
In reply to: Tom Lane (#55)
#58James Coleman
jtc331@gmail.com
In reply to: Juan José Santamaría Flecha (#56)
#59James Coleman
jtc331@gmail.com
In reply to: James Coleman (#58)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: James Coleman (#59)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#60)
#62James Coleman
jtc331@gmail.com
In reply to: Tom Lane (#61)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: James Coleman (#62)
#64James Coleman
jtc331@gmail.com
In reply to: Tom Lane (#63)