Regular expression to UPPER() a lower case string

Started by Eagnaover 3 years ago15 messagesgeneral
Jump to latest
#1Eagna
eagna@protonmail.com

Hi all,

as per the subject, I want a regular expression to do what the UPPER() function does.

Obviously, I know about that function and it is not what I want.

This should be very (very) easy - I don't know what I'm missing - I've done quite complex regular expressions before and I don't know what I'm doing wrong. Brain burping this morning!

Here's a dbfiddle where I've tried a couple of things (https://dbfiddle.uk/G5c_CycU) - but I just can't seem to get it working...

Any input appreciated.

E.

Sent with [Proton Mail](https://proton.me/) secure email.

#2Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Eagna (#1)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 Eagna <eagna@protonmail.com> wrote:

This should be very (very) easy - I don't know what I'm missing -
I've done quite complex regular expressions before and I don't know
what I'm doing wrong. Brain burping this morning!

You're missing that:

* `regexp_replace` doesn't work like that, at all
* your logic only works by accident for some languages (try to upcase
a `ß` or a `ı`)

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#3Eagna
eagna@protonmail.com
In reply to: Gianni Ceccarelli (#2)
Re: Regular expression to UPPER() a lower case string

Ciao and thanks for your input.

* `regexp_replace` doesn't work like that, at all
* your logic only works by accident for some languages (try to upcase
a `ß` or a `ı`)

If you have any ideas how it could be done indirectly/different strategy - I'm all ears.

You can assume all English characters [a-z][A-Z].

E.

Show quoted text

Dakkar - <Mobilis in mobile>

#4Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Eagna (#3)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 Eagna <eagna@protonmail.com> wrote:

If you have any ideas how it could be done indirectly/different
strategy - I'm all ears.

You haven't explained what you're trying to accomplish.

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#5Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Gianni Ceccarelli (#2)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 10:41:41 +0000, Gianni Ceccarelli wrote:

On 2022-12-10 Eagna <eagna@protonmail.com> wrote:

This should be very (very) easy - I don't know what I'm missing -
I've done quite complex regular expressions before and I don't know
what I'm doing wrong. Brain burping this morning!

You're missing that:

* `regexp_replace` doesn't work like that, at all

No it doesn't. But the equivalent operation in some other languages and
tools does, so there is no reason that it couldn't do that (of course
neither is that a reason why it should).

* your logic only works by accident for some languages (try to upcase
a `ß` or a `ı`)

This is also true of upper() and lower() and SQL does provide those.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#6Eagna
eagna@protonmail.com
In reply to: Gianni Ceccarelli (#4)
Re: Regular expression to UPPER() a lower case string

Hi again, and thanks for sticking with this.

You haven't explained what you're trying to accomplish.

Ok.

CREATE TABLE test(x TEXT);

INSERT INTO test VALUES ('abc');

SELECT REGEXP_REPLACE(x, '<something>', '<something_else>', 'g') FROM test;

Expected result: ABC

See fiddle here: https://dbfiddle.uk/Q2qXXwtF

David Johnston suggested something along these lines:

==========

RegExp by itself cannot do this. You have to match all parts of the input into different capturing groups, then use lower() combined with format() to build a new string. Putting the capturing groups into an array is the most useful option.

===========

But it's a bit above my pay grade to do this - I've tried, but no go! :-( It *_appears_* to me that the string's length would have to be hard coded under this strategy - but if that's the only way, then so be it.

I'd just be interested to see a solution based on DJ's suggestion or any other code that would use REGEXP_REPLACE() to do what I want - preferably without hard coding, but if it's absolutely necessary.

Thanks for any input.

E.

#7Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Peter J. Holzer (#5)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:

* your logic only works by accident for some languages (try to
upcase a `ß` or a `ı`)

This is also true of upper() and lower() and SQL does provide those.

Well…

select upper('ı');

┌───────┐
│ upper │
├───────┤
│ I │
└───────┘
(1 row)

select upper('ß');

┌───────┐
│ upper │
├───────┤
│ ß │
└───────┘
(1 row)

select upper('ä');

┌───────┐
│ upper │
├───────┤
│ Ä │
└───────┘
(1 row)

Of course all of this is dependent of locale, too.

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#8Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Eagna (#6)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 Eagna <eagna@protonmail.com> wrote:

Hi again, and thanks for sticking with this.

You haven't explained what you're trying to accomplish.

Ok.

CREATE TABLE test(x TEXT);

INSERT INTO test VALUES ('abc');

SELECT REGEXP_REPLACE(x, '<something>', '<something_else>', 'g')
FROM test;

Expected result: ABC

`select upper(x) from test`

You haven't explained why you need regexes at all. If you want to
convert the value of a text column to uppercase, there's a function
for that.

If you want to do something else, please describe the actual thing you
want to do. Not "how", but "what".

David Johnston suggested something along these lines:

==========

RegExp by itself cannot do this. You have to match all parts of the
input into different capturing groups, then use lower() combined
with format() to build a new string. Putting the capturing groups
into an array is the most useful option.

===========

But it's a bit above my pay grade to do this - I've tried, but no go!
:-( It *_appears_* to me that the string's length would have to be
hard coded under this strategy - but if that's the only way, then so
be it.

Maybe you need to look at `regexp_split_to_array` or
`regexp_split_to_table`?

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#9Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Gianni Ceccarelli (#7)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 13:44:37 +0000, Gianni Ceccarelli wrote:

On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:

* your logic only works by accident for some languages (try to
upcase a `ß` or a `ı`)

This is also true of upper() and lower() and SQL does provide those.

Well…

select upper('ı');

┌───────┐
│ upper │
├───────┤
│ I │
└───────┘
(1 row)

This is I think universally correct. A better example would be
upper('i') which should be 'İ' in Turkish and 'I' in most other
languages.

select upper('ß');

┌───────┐
│ upper │
├───────┤
│ ß │
└───────┘
(1 row)

This is incorrect according to German spelling rules. It should be
either 'SS' (traditionally) or 'ẞ' (since the introduction of the
upper-case sharp s). However, given the long absence of the ẞ from
official German orthography and the lack of reversability of the ß → SS
mapping it has been (and still is) quite common to leave the ß in lower
case.

select upper('ä');

┌───────┐
│ upper │
├───────┤
│ Ä │
└───────┘
(1 row)

Correct (in German[1]Although I have one book which uses ä, ö, ü for lower case but Ae, Oe, Ue for upper case letters. and probably any other language).

So, what's the point you are trying to make?

Of course all of this is dependent of locale, too.

Right. But why would that be different for regexp_replace than for
upper/lower)?

hp

[1]: Although I have one book which uses ä, ö, ü for lower case but Ae, Oe, Ue for upper case letters.
Oe, Ue for upper case letters.

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#10Eagna
eagna@protonmail.com
In reply to: Gianni Ceccarelli (#8)
Re: Regular expression to UPPER() a lower case string

`select upper(x) from test`

I know about the UPPER() and LOWER() functions - I don't want them!

If you want to do something else, please describe the actual thing you
want to do. Not "how", but "what".

I have described it - I want to do the *_same_* thing as UPPER() does using REGEXP_REPLACE().

In the duplicate thread (apologies to the list for that - my bad!), I wrote:

====================

I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test.

I could always have used another REGEXP_REPLACE() for my testing, but I then became "obsessed" with the idea of using REGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated code competition with myself! :-)

========================

So, I have no actual *_need_* for this, other than a desire to learn and understand what's going on and why.

Maybe you need to look at `regexp_split_to_array` or
`regexp_split_to_table`?

I'm pretty sure that I have a good understanding of these. I thought I understood REGEXP_REPLACE(), but obviously not.

Thanks for any further input.

E.

Show quoted text

Dakkar - <Mobilis in mobile>

#11Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Eagna (#10)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 14:36:04 +0000, Eagna wrote:

I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test.

I could always have used another REGEXP_REPLACE() for my testing,
but I then became "obsessed" with the idea of using
REGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated
code competition with myself! :-)

========================

So, I have no actual *_need_* for this, other than a desire to learn
and understand what's going on and why.

You can't do that. Well, theoretically you could replace every
individual lower case letter with it's upper case equivalent:

select regexp_replace(...regexp_replace(regexp_replace(s, 'a', 'A'), 'b', 'B')... 'z', 'Z') ...

but that would be insane even for the 26 letters of the basic Latin
alphabet, much less the myriad of accented letters (and other alphabets
like Cyrillic or Greek ...).

On second thought you could probably use NFD normalization to separate
base letters from accents, uppercase the base letters and then
(optionally) NFC normalize everything again. Still insane ;-).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#12Gianni Ceccarelli
dakkar@thenautilus.net
In reply to: Peter J. Holzer (#9)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:

So, what's the point you are trying to make?

I suspect I mis-understood a thing you wrote… ignore me.

Of course all of this is dependent of locale, too.

Right. But why would that be different for regexp_replace than for
upper/lower?

I was trying to point out (badly) that the code shown, based
on a simplistic application of `regex_replace`, is going to work
even worse than `upper` (as you point out elsethread).

--
Dakkar - <Mobilis in mobile>
GPG public key fingerprint = A071 E618 DD2C 5901 9574
6FE2 40EA 9883 7519 3F88
key id = 0x75193F88

#13Eagna
eagna@protonmail.com
In reply to: Peter J. Holzer (#11)
Re: Regular expression to UPPER() a lower case string

Hi, and thanks for all of the input - I think I'm beginning to grok it.

On second thought you could probably use NFD normalization to separate
base letters from accents, uppercase the base letters and then
(optionally) NFC normalize everything again. Still insane ;-).

As far as I can see, I can only do what I want by using the following.

If there's a monthly prize on this list for the most insanely contrived piece of code, I think this is a strong contender:

test.x = 'abc'

SELECT
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
FROM test;

Result: ABC

(see https://dbfiddle.uk/LQ-6udga).

Still a bit of work to do, but I'm getting there :-)

E.

Show quoted text

hp

#14Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Eagna (#13)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 15:37:01 +0000, Eagna wrote:

Hi, and thanks for all of the input - I think I'm beginning to grok it.

I'm not so sure about that

As far as I can see, I can only do what I want by using the following.

If there's a monthly prize on this list for the most insanely
contrived piece of code, I think this is a strong contender:

test.x = 'abc'

SELECT
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
FROM test;

Result: ABC

I don't think this does what you want it to do:

wds=> SELECT x,
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
FROM test;
╔═════════════╤═════════════════════════════╗
║ x │ ?column? ║
╟─────────────┼─────────────────────────────╢
║ abc_def_ghi │ A_DEF_GHIB_DEF_GHIC_DEF_GHI ║
╚═════════════╧═════════════════════════════╝

(I admit it took me at least a minute to figure out what was happening
here)

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#15Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Peter J. Holzer (#11)
Re: Regular expression to UPPER() a lower case string

On 2022-12-10 15:48:58 +0100, Peter J. Holzer wrote:

On second thought you could probably use NFD normalization to separate
base letters from accents, uppercase the base letters and then
(optionally) NFC normalize everything again.

Of course I had to try that:

wds=> select
normalize(
replace(
replace(
replace(
replace(
normalize('Käse', NFD),
's', 'S'
),
'k', 'K'
),
'e', 'E'
),
'a', 'A'
),
NFC
)
;
╔═══════════╗
║ normalize ║
╟───────────╢
║ KÄSE ║
╚═══════════╝
(1 row)

Works as expected.

Still insane ;-).

I haven't changed my mind about that.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"