Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Started by Alexander Farberabout 13 years ago9 messagesgeneral
Jump to latest
#1Alexander Farber
alexander.farber@gmail.com

Hello,

I have prepared an SQL fiddle for my question:
http://sqlfiddle.com/#!11/8a494/4

And also described it in more detail at
http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in-conditionals-to-return-null

Does anybody please know how to check for
UTF8 range \x0410-\x042F in my code below?

I've tried both
new.word !~ '^[\x0410-\x042F]{2,}$'
(fails with syntax error) and
new.word !~ '^[\u0410-\u042F]{2,}$'
(triggers even for correct words):

create table good_words (
word varchar(64) primary key
);

create or replace function keep_clean() returns trigger as $body$
begin
new.word := upper(new.word);

/* next line does not compile? */
IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
END IF;

IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
return NULL;
END IF;

/* does not return NULL for 'ошибббка'? */
IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
return NULL;
END IF;

return new;
end;
$body$ language plpgsql;

Thank you
Alex

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Alexander Farber
alexander.farber@gmail.com
In reply to: Alexander Farber (#1)
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

I'm trying at the psql prompt of an UTF8 database in 8.4.13:

# select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
WARNING: nonstandard use of escape in a string literal
LINE 1: select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
?column?
----------
f
(1 row)

On Tue, Mar 19, 2013 at 4:10 PM, Alexander Farber
<alexander.farber@gmail.com> wrote:

http://sqlfiddle.com/#!11/8a494/4

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Farber (#2)
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Alexander Farber <alexander.farber@gmail.com> writes:

I'm trying at the psql prompt of an UTF8 database in 8.4.13:
# select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
WARNING: nonstandard use of escape in a string literal

I think Unicode escapes were introduced in 9.0. In 8.4 you'd probably
have to write out the UTF8 equivalent as octal escapes :-(

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Alexander Farber (#1)
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Alexander Farber wrote:

I have prepared an SQL fiddle for my question:
http://sqlfiddle.com/#!11/8a494/4

And also described it in more detail at
http://stackoverflow.com/questions/15500270/string-matching-in-insert-trigger-how-to-use-in-
conditionals-to-return-null

Does anybody please know how to check for
UTF8 range \x0410-\x042F in my code below?

I've tried both
new.word !~ '^[\x0410-\x042F]{2,}$'
(fails with syntax error) and
new.word !~ '^[\u0410-\u042F]{2,}$'
(triggers even for correct words):

Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2,
encoding "UTF8", collation and ctype "de_DE.UTF8"):

test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$';
?column?
----------
t
(1 row)

test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$';
?column?
----------
f
(1 row)

create table good_words (
word varchar(64) primary key
);

create or replace function keep_clean() returns trigger as $body$
begin
new.word := upper(new.word);

/* next line does not compile? */
IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
END IF;

IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
return NULL;
END IF;

/* does not return NULL for 'ошибббка'? */
IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
return NULL;

This works for me as well:

test=> SELECT 'ошибббка' ~ '(.)\1\1'
AND 'ошибббка' NOT LIKE '%ШЕЕЕ%'
AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%';
?column?
----------
t
(1 row)

test=> SELECT 'ошиббка' ~ '(.)\1\1'
AND 'ошиббка' NOT LIKE '%ШЕЕЕ%'
AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%';
?column?
----------
f
(1 row)

END IF;

return new;
end;
$body$ language plpgsql;

What do you get for

SELECT pg_encoding_to_char(encoding),
datcollate,
datctype
FROM pg_database WHERE datname = current_database();

and for

SHOW client_encoding;

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Alexander Farber
alexander.farber@gmail.com
In reply to: Laurenz Albe (#4)
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Thanks for trying! I am using CentOS 6.3

It seems to be better in 9.2.x?

Unfortunately I'd like to stay with 8.4.x for now
(because I use the PostgreSQL instance
with other projects at the same host)....

Regards
Alex

On Wed, Mar 20, 2013 at 10:35 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Alexander Farber wrote:

I have prepared an SQL fiddle for my question:
http://sqlfiddle.com/#!11/8a494/4

Strange, it works here (RHEL 6, x86_64, PostgreSQL 9.2.2,
encoding "UTF8", collation and ctype "de_DE.UTF8"):

test=> SELECT 'ПРОВЕРКА' ~ '^[\u0410-\u042F]{2,}$';
?column?
----------
t
(1 row)

test=> SELECT 'ABCDE' ~ '^[\u0410-\u042F]{2,}$';
?column?
----------
f
(1 row)

create table good_words (
word varchar(64) primary key
);

create or replace function keep_clean() returns trigger as $body$
begin
new.word := upper(new.word);

/* next line does not compile? */
IF new.word !~ '^[\x0410-\x042F]{2,}$' THEN
RAISE EXCEPTION 'Not an uppercased Russian word in UTF8';
END IF;

IF new.word ~ '^[ЪЫЬ]' OR new.word ~ 'Ъ$' THEN
return NULL;
END IF;

/* does not return NULL for 'ошибббка'? */
IF new.word ~ '(.)\1\1' AND new.word NOT LIKE '%ШЕЕЕ%'
AND new.word NOT LIKE '%ЗМЕЕЕ%' THEN
return NULL;

This works for me as well:

test=> SELECT 'ошибббка' ~ '(.)\1\1'
AND 'ошибббка' NOT LIKE '%ШЕЕЕ%'
AND 'ошибббка' NOT LIKE '%ЗМЕЕЕ%';
?column?
----------
t
(1 row)

test=> SELECT 'ошиббка' ~ '(.)\1\1'
AND 'ошиббка' NOT LIKE '%ШЕЕЕ%'
AND 'ошиббка' NOT LIKE '%ЗМЕЕЕ%';
?column?
----------
f
(1 row)

END IF;

return new;
end;
$body$ language plpgsql;

What do you get for

SELECT pg_encoding_to_char(encoding),
datcollate,
datctype
FROM pg_database WHERE datname = current_database();

and for

SHOW client_encoding;

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Alexander Farber (#5)
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Alexander Farber wrote:

It seems to be better in 9.2.x?

Yes, as Tom has pointed out.
I didn't see that you were on 8.4 when I wrote my answer.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Farber (#5)
Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Alexander Farber <alexander.farber@gmail.com> writes:

Thanks for trying! I am using CentOS 6.3
It seems to be better in 9.2.x?

As stated upthread, 8.4 doesn't understand \u escapes. You'd need to
put in the characters another way --- either literally, or using octal
escapes to spell out the UTF8 encoding. I think it will work in 8.4
if you do, but not 100% sure.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Alexander Farber
alexander.farber@gmail.com
In reply to: Tom Lane (#3)
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

Hello,

unfortunately octal doesn't seem to work either -

On Tue, Mar 19, 2013 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Farber <alexander.farber@gmail.com> writes:

# select 'АБВГД' ~ '^[\u0410-\u042F]{2,}$';
WARNING: nonstandard use of escape in a string literal

I think Unicode escapes were introduced in 9.0. In 8.4 you'd probably
have to write out the UTF8 equivalent as octal escapes :-(

# select 'АБВГД' ~ '^[\2020-\2057]{2,}$';
WARNING: nonstandard use of escape in a string literal
LINE 1: select 'АБВГД' ~ '^[\2020-\2057]{2,}$';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR: invalid byte sequence for encoding "UTF8": 0x82
HINT: This error can also happen if the byte sequence does not
match the encoding expected by the server, which is controlled by
"client_encoding".

But writing out UTF8 equivalents seems to work
(trying to detect capitalized Russian letters as per
http://www.unicode.org/charts/PDF/U0400.pdf ):

# select 'АБВГД' ~ '^[А-Я]{2,}$';
?column?
----------
t
(1 row)

And then I try to solve my 2nd problem (detecting 3
letters in a row, a rare case in Russian language):

# select 'ОШИБББКА' ~ '(.)\1\1';
WARNING: nonstandard use of escape in a string literal
LINE 1: select 'ОШИБББКА' ~ '(.)\1\1';
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
?column?
----------
f
(1 row)

Does anybody please know why this fails in 8.4.13?

According to the table 9-18 in
http://www.postgresql.org/docs/8.4/static/functions-matching.html
it should be ok to use \1 for referencing
parts captured by round brackets?

Regards
Alex

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Alexander Farber
alexander.farber@gmail.com
In reply to: Alexander Farber (#8)
Re: Re: Matching uppercased russian words (\x0410-\x042F) in UTF8 database 8.4.13

My insert trigger for 8.4.13 works now:
http://sqlfiddle.com/#!11/c74a1/3

Thank you for you help

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general