regex match and special characters

Started by Alexey Klyukinover 7 years ago13 messagesgeneral
Jump to latest
#1Alexey Klyukin
alexk@commandprompt.com

Hi,

Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)

test=# select 'abcd ' ~ 'abcd\s';
?column?
----------
t
(1 row)

test=# select version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)

On another server (running on the same system on a different port)

postgres=# select version();
version
-----------------------------------------------------------------------------------------------
PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)

postgres=# select 'abcd ' ~ 'abcd\s';
?column?
----------
f
(1 row)

For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively, and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host.

I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL 10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped accepting rows valid in the older version during the migration. Making it select 'abcd ' ~ E'abcd\\s' doesn't modify the outcome, unsurprisingly.

Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?

Cheers,
Alex

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Alexey Klyukin (#1)
Re: regex match and special characters

On 08/16/2018 03:59 AM, Alex Kliukin wrote:

Hi,

Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)

test=# select 'abcd ' ~ 'abcd\s';
?column?
----------
t
(1 row)

test=# select version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)

On another server (running on the same system on a different port)

postgres=# select version();
version
-----------------------------------------------------------------------------------------------
PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)

postgres=# select 'abcd ' ~ 'abcd\s';
?column?
----------
f
(1 row)

For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively, and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host.

I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL 10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped accepting rows valid in the older version during the migration. Making it select 'abcd ' ~ E'abcd\\s' doesn't modify the outcome, unsurprisingly.

Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?

select version();
version

------------------------------------------------------------------------------------
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit

lc_collate | en_US.UTF-8

lc_ctype | en_US.UTF-8

test=# select 'abcd'||chr(2006) ~ E'abcd\s';
?column?
----------
f
(1 row)

In your example you are working on Postgres devel. Have you tried it on
Postgres 10 and/or 11?

Cheers,
Alex

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexey Klyukin (#1)
Re: regex match and special characters

Alex Kliukin <alexk@hintbits.com> writes:

Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)

I think the reason for the discrepancy is that in v10 we fixed the regex
locale support so that it could properly classify code points above U+7FF,
cf

https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=c54159d44ceaba26ceda9fea1804f0de122a8f30

So 10 is giving the right answer (i.e. that \s matches U+2006).
9.x is not, but we're not going to back-patch such a large change.

regards, tom lane

#4Andreas Joseph Krogh
andreas@visena.com
In reply to: Adrian Klaver (#2)
Sv: Re: regex match and special characters

På torsdag 16. august 2018 kl. 15:16:52, skrev Adrian Klaver <
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>:
On 08/16/2018 03:59 AM, Alex Kliukin wrote:

Hi,

Here is a simple SQL statement that gives different results on PostgreSQL

9.6 and PostgreSQL 10+. The space character at the end of the string is
actually U+2006 SIX-PER-EM SPACE
(http://www.fileformat.info/info/unicode/char/2006/index.htm)

test=# select 'abcd ' ~ 'abcd\s';
   ?column?
----------
   t
(1 row)

test=# select version();
                                               version

-------------------------------------------------------------------------------------------------

   PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo

6.4.0-r1 p1.3) 6.4.0, 64-bit

(1 row)

On another server (running on the same system on a different port)

postgres=# select version();
                                              version

-----------------------------------------------------------------------------------------------

   PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1

p1.3) 6.4.0, 64-bit

(1 row)

postgres=# select 'abcd ' ~ 'abcd\s';
   ?column?
----------
   f
(1 row)

For both clusters, the client encoding is UTF8, the database encoding and

collation is UTF8 and en_US.utf8 respectively, and the lc_ctype is en_US.utf8.
I am accessing the databases running locally by ssh-ing first to the host.

I observed similar issues with other Linux-based servers running Ubuntu, in

all cases the regex resulted in true on PostgreSQL 10+ and false on earlier
versions (down to 9.3). The query comes from a table check that suddenly
stopped accepting rows valid in the older version during the migration. Making
it  select 'abcd ' ~ E'abcd\\s' doesn't  modify the outcome, unsurprisingly.

Is it reproducible for others here as well? Given that it is, Is there a

way to make both versions behave the same?

select version();
                                       version

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit

lc_collate                          | en_US.UTF-8

lc_ctype                            | en_US.UTF-8

test=# select 'abcd'||chr(2006) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)

In your example you are working on Postgres devel. Have you tried it on
Postgres 10 and/or 11?
 
char(2006) produces the wrong character as 2006 is the hex-value. You have to
use 8198:
 
andreak@[local]:5433 10.4 andreak=# select version();

┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                            version
                                            │

├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.3.0-16ubuntu3) 7.3.0, 64-bit │

└────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)
 
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

 

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Andreas Joseph Krogh (#4)
Re: Sv: Re: regex match and special characters

On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:

char(2006) produces the wrong character as 2006 is the hex-value. You
have to use 8198:
andreak@[local]:543310.4 andreak=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                            version
                                            │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.3.0-16ubuntu3) 7.3.0, 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

Argh, read the wrong line. Thanks for the correction. Still:

test=# select version();
version

------------------------------------------------------------------------------------
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\s';
?column?
----------
f
(1 row)

--
*Andreas Joseph Krogh*
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Andreas Joseph Krogh
andreas@visena.com
In reply to: Adrian Klaver (#5)
Sv: Re: Sv: Re: regex match and special characters

På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver <
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>:
On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:

char(2006) produces the wrong character as 2006 is the hex-value. You
have to use 8198:
andreak@[local]:543310.4 andreak=# select version();

┌────────────────────────────────────────────────────────────────────────────────────────────────┐

│                                            version
                                              │

├────────────────────────────────────────────────────────────────────────────────────────────────┤

│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.3.0-16ubuntu3) 7.3.0, 64-bit │

└────────────────────────────────────────────────────────────────────────────────────────────────┘

(1 row)

andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

Argh, read the wrong line. Thanks for the correction. Still:

test=# select version();
                                       version

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)
 
When using E-syntax you need to double the backslash for escaping:
 
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)
 
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

 

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Andreas Joseph Krogh (#6)
Re: Sv: Re: Sv: Re: regex match and special characters

On 08/16/2018 07:47 AM, Andreas Joseph Krogh wrote:

På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>:

On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:

char(2006) produces the wrong character as 2006 is the hex-value. You
have to use 8198:
andreak@[local]:543310.4 andreak=# select version();

┌────────────────────────────────────────────────────────────────────────────────────────────────┐

│                                            version
                                              │

├────────────────────────────────────────────────────────────────────────────────────────────────┤

│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.3.0-16ubuntu3) 7.3.0, 64-bit │

└────────────────────────────────────────────────────────────────────────────────────────────────┘

(1 row)

andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~

'abcd\s';

┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

Argh, read the wrong line. Thanks for the correction. Still:

test=# select version();
                                       version

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)

When using E-syntax you need to double the backslash for escaping:
andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

Hmm:

test=# show standard_conforming_strings;
standard_conforming_strings
-----------------------------
off
(1 row)

test=# show escape_string_warning;
escape_string_warning
-----------------------
on
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
WARNING: nonstandard use of escape in a string literal
LINE 1: select 'abcd'||chr(8198) ~ 'abcd\s';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
?column?
----------
f
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\\s';
WARNING: nonstandard use of \\ in a string literal
LINE 1: select 'abcd'||chr(8198) ~ 'abcd\\s';
^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
?column?
----------
t
(1 row)

test=# set escape_string_warning = 'off';
SET
test=# show escape_string_warning;
escape_string_warning
-----------------------
off
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
f
(1 row)

test=# set standard_conforming_strings = 'on';
SET
test=# show standard_conforming_strings;
standard_conforming_strings
-----------------------------
on
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
t
(1 row)

Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?

--
*Andreas Joseph Krogh*
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Joseph Krogh (#6)
Re: Sv: Re: Sv: Re: regex match and special characters

Andreas Joseph Krogh <andreas@visena.com> writes:

When using E-syntax you need to double the backslash for escaping:
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';

Another thing to keep in mind is that this is all locale-dependent
(specifically, LC_CTYPE, I believe). In plain old C locale, nothing
beyond the standard ASCII whitespace characters will match \s. I'm not
sure how universal it is for other locales to treat characters like
U+2006 as whitespace.

regards, tom lane

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Adrian Klaver (#7)
Re: Sv: Re: Sv: Re: regex match and special characters

On 08/16/2018 08:13 AM, Adrian Klaver wrote:

Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?

In the above referring to 9.6.9 instance.

--
*Andreas Joseph Krogh*
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Adrian Klaver (#9)
Re: Sv: Re: Sv: Re: regex match and special characters

On 08/16/2018 08:19 AM, Adrian Klaver wrote:

On 08/16/2018 08:13 AM, Adrian Klaver wrote:

Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?

In the above referring to 9.6.9 instance.

Well that theory is no good:

test=# select version();
version

-------------------------------------------------------------------------------------
PostgreSQL 9.6.7 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)

test=# set standard_conforming_strings='off';
SET
test=# show standard_conforming_strings;
standard_conforming_strings
-----------------------------
off

test=# show escape_string_warning;
escape_string_warning
-----------------------
off

test=# select 'abcd'||chr(8198) ~ E'abcd\s';
?column?
----------
f
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\\s';
?column?
----------
f
(1 row)

test=# set escape_string_warning='on';
SET
test=# set standard_conforming_strings='on';
SET

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
f
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\\s';
?column?
----------
f
(1 row)

--
*Andreas Joseph Krogh*
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com&gt;
<https://www.visena.com&gt;

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Alexey Klyukin
alexk@commandprompt.com
In reply to: Tom Lane (#3)
Re: regex match and special characters

On 16. Aug 2018, at 16:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alex Kliukin <alexk@hintbits.com> writes:

Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)

I think the reason for the discrepancy is that in v10 we fixed the regex
locale support so that it could properly classify code points above U+7FF,
cf

https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=c54159d44ceaba26ceda9fea1804f0de122a8f30 <https://git.postgresql.org/gitweb/?p=postgresql.git&amp;a=commitdiff&amp;h=c54159d44ceaba26ceda9fea1804f0de122a8f30&gt;

This nails down the cause, thanks a lot for the link! Apparently I missed it from PostgreSQL 10 release notes, where it is present in the “Queries” section, although AFAIK it deserved an entry in the "migration to version 10”, as it may potentially make dump/restore from previous versions to version 10 error out if there are table constraints that use regex classes over the Unicode text fields with code points above U+7FF.

So 10 is giving the right answer (i.e. that \s matches U+2006).
9.x is not

Agreed.

Cheers,
Alex

#12Alexey Klyukin
alexk@commandprompt.com
In reply to: Adrian Klaver (#7)
Re: regex match and special characters

Hi Adrian,

On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
t
(1 row)

Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?

Both are set to ‘on’ for me for all versions (I believe those are default settings). I have 12devel indeed on my test system alongside 9.6, but I’ve tried it as well on PostgreSQL 10 running on a different distro with different locale settings and it produced the same result (check being true).

I think Tom’s answer solves it, although I am wondering how did you get true in the statement quoted above on PostgreSQL 9.6, perhaps that result is actually from PostgreSQL 10?

Cheers,
Oleksii

#13Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Alexey Klyukin (#12)
Re: regex match and special characters

On 08/18/2018 08:12 AM, Oleksii Kliukin wrote:

Hi Adrian,

On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
t
(1 row)

Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?

Both are set to ‘on’ for me for all versions (I believe those are
default settings). I have 12devel indeed on my test system alongside
9.6, but I’ve tried it as well on PostgreSQL 10 running on a different
distro with different locale settings and it produced the same result
(check being true).

I think Tom’s answer solves it, although I am wondering how did you get
true in the statement quoted above on PostgreSQL 9.6, perhaps that
result is actually from PostgreSQL 10?

Yes it was on Postgres 10. I got reversed in my thinking. A subsequent post:

/messages/by-id/8b8d9250-1ee3-6aff-b580-592c7c759586@aklaver.com

shows the behavior you saw on 9.6. What happens when I work through my
thought process on line:(

Cheers,
Oleksii

--
Adrian Klaver
adrian.klaver@aklaver.com