Fault with initcap

Started by Shaozhong SHIover 4 years ago22 messagesgeneral
Jump to latest
#1Shaozhong SHI
shishaozhong@gmail.com

I tried initcap and found a major problem with it.

Initcap of notemachine is NoteMachine.

Initcap of Sainsbury's Bank is Sainsbury'S bank.

This is not expected.

Anyway to get around this problem?

Regards,

David

#2Shaozhong SHI
shishaozhong@gmail.com
In reply to: Shaozhong SHI (#1)

I tried initcap and found a major problem with it.

Initcap of notemachine is NoteMachine.

Initcap of Sainsbury's Bank is Sainsbury'S bank.

This is not expected.

Anyway to get around this problem?

Regards,

David

#3Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Shaozhong SHI (#1)
Aw: Fault with initcap

I tried initcap and found a major problem with it.
Initcap of notemachine is NoteMachine.
Initcap of Sainsbury's Bank is Sainsbury'S bank.
This is not expected.

Anyway to get around this problem?

 
Sure. Step one: define "expected"

Karsten

#4Rob Sargent
robjsargent@gmail.com
In reply to: Shaozhong SHI (#1)
Re: Fault with initcap

On Oct 12, 2021, at 10:30 AM, Shaozhong SHI <shishaozhong@gmail.com> wrote:

I tried initcap and found a major problem with it.

Initcap of notemachine is NoteMachine.

My version 12 system gives Notemachine

Show quoted text

Initcap of Sainsbury's Bank is Sainsbury'S bank.

This is not expected.

Anyway to get around this problem?

Regards,

David

#5Steve Midgley
science@misuse.org
In reply to: Rob Sargent (#4)
Re: Fault with initcap

On Tue, Oct 12, 2021 at 9:54 AM Rob Sargent <robjsargent@gmail.com> wrote:

On Oct 12, 2021, at 10:30 AM, Shaozhong SHI <shishaozhong@gmail.com>

wrote:

I tried initcap and found a major problem with it.

Initcap of notemachine is NoteMachine.

My version 12 system gives Notemachine

Initcap of Sainsbury's Bank is Sainsbury'S bank.

This is not expected.

Anyway to get around this problem?

Pg v.13 doesn't have a problem with notemachine either, but does seem to
screw up Sainsbury's Bank as Sainsbury'S bank.

https://www.db-fiddle.com/f/m2BMATwVcp6A1ZvD8TyurW/1

The only way I could think of fixing this would be to swap out the
apostrophe with an alpha string not present in the string, and then swap
back when you're done (see example above -- I'm not recommending using
"xyzxyz" - that's just an example. And ideally you'd search for your
placeholder string first to be sure it doesn't exist, or escape it).

Or just use a regexp to solve your casing problem -- maybe break up your
string by whitespace into array elements, uppercase the first letter of
each array and the concat the array elements back?

Steve

#6Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Shaozhong SHI (#1)
Aw: Re: Fault with initcap

Hi David,

Expected are as follows:
Notemachine
Sainsbury's bank.

Now, step two: generalize that exemplary definition.

Karsten

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#2)
Re: Fault with initcap

On 10/12/21 09:31, Shaozhong SHI wrote:

I tried initcap and found a major problem with it.

What Postgres version?

In version 12 and 14 I get:

Initcap of notemachine is NoteMachine.

select initcap('notemachine');
initcap
-------------
Notemachine

Initcap of Sainsbury's Bank is Sainsbury'S bank.

select initcap('Sainsbury''s Bank');
initcap
------------------
Sainsbury'S Bank

Which follows the definition here:

https://www.postgresql.org/docs/14/functions-string.html

initcap ( text ) → text

Converts the first letter of each word to upper case and the rest to
lower case. Words are sequences of alphanumeric characters separated by
non-alphanumeric characters.

This is not expected.

What is the encoding, collate, ctype for the database?

Can be found in psql using:

\l db_name

Anyway to get around this problem?

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Shaozhong SHI (#1)
Aw: Re: Re: Fault with initcap

Only the first letter of each word to be capitalised/uppercased.

The next step is to not top-post.

Then to keep the list involved if you wish further help.

Then, if you are intent on using regular expressions, look at
the PostgreSQL docs for regexp_replace.

Karsten
 

#9Shaozhong SHI
shishaozhong@gmail.com
In reply to: Adrian Klaver (#7)
Re: Fault with initcap

On Tue, 12 Oct 2021 at 20:34, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 10/12/21 09:31, Shaozhong SHI wrote:

I tried initcap and found a major problem with it.

What Postgres version?

In version 12 and 14 I get:

Initcap of notemachine is NoteMachine.

select initcap('notemachine');
initcap
-------------
Notemachine

Initcap of Sainsbury's Bank is Sainsbury'S bank.

select initcap('Sainsbury''s Bank');
initcap
------------------
Sainsbury'S Bank

Which follows the definition here:

https://www.postgresql.org/docs/14/functions-string.html

initcap ( text ) → text

Converts the first letter of each word to upper case and the rest to
lower case. Words are sequences of alphanumeric characters separated by
non-alphanumeric characters.

Hi, Adrian Klaver,

It looks like that you replicated the error.

There must be a way to do the following.

a column contains a list of words. Only the first letter of each word
should be capitalised. INITCAP can not do that. How to create a function
just to capitalised each word (substring) in a list of words/strings. This
will be very useful and create great impact.

Regards,

David

#10Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Shaozhong SHI (#9)
Re: Fault with initcap

Am Tue, Oct 12, 2021 at 09:50:16PM +0100 schrieb Shaozhong SHI:

There must be a way to do the following.

[...] Only the first letter of each word should be capitalised.

Indeed, there is. It is called "human brain in cultural
context". "AI" is close nowadays, but, hopefully, not quite
there yet.

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#9)
Re: Fault with initcap

On 10/12/21 13:50, Shaozhong SHI wrote:

On Tue, 12 Oct 2021 at 20:34, Adrian Klaver <adrian.klaver@aklaver.com

Which follows the definition here:

https://www.postgresql.org/docs/14/functions-string.html
<https://www.postgresql.org/docs/14/functions-string.html&gt;

initcap ( text ) → text

Converts the first letter of each word to upper case and the rest to
lower case. Words are sequences of alphanumeric characters separated by
non-alphanumeric characters.

Hi, Adrian Klaver,

It looks like that you replicated the error.

There is no error, initcap is doing what it is documented to.

notemachine is not two words anymore then 'online', 'bluebell',
'network' are.

There must be a way to do the following.

Maybe, but as Karsten says it would involve an AI. One that understands
the mutt language that is English.

a column contains a list of words.  Only the first letter of each word
should be capitalised.  INITCAP can not do that.  How to create a
function just to capitalised each word (substring) in a list of
words/strings.  This will be very useful and create great impact.

From here:

https://www.grammarly.com/blog/14-of-the-longest-words-in-english/

uncopyrightable

where would you split that into words?:

Some 'words' I see:

un
unc
copy
copyright
right
table
able

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Shaozhong SHI
shishaozhong@gmail.com
In reply to: Adrian Klaver (#11)
Re: Fault with initcap

On Tue, 12 Oct 2021 at 23:02, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 10/12/21 13:50, Shaozhong SHI wrote:

On Tue, 12 Oct 2021 at 20:34, Adrian Klaver <adrian.klaver@aklaver.com

Which follows the definition here:

https://www.postgresql.org/docs/14/functions-string.html
<https://www.postgresql.org/docs/14/functions-string.html&gt;

initcap ( text ) → text

Converts the first letter of each word to upper case and the rest to
lower case. Words are sequences of alphanumeric characters separated

by

non-alphanumeric characters.

Hi, Adrian Klaver,

It looks like that you replicated the error.

There is no error, initcap is doing what it is documented to.

notemachine is not two words anymore then 'online', 'bluebell',
'network' are.

There must be a way to do the following.

Maybe, but as Karsten says it would involve an AI. One that understands
the mutt language that is English.

a column contains a list of words. Only the first letter of each word
should be capitalised. INITCAP can not do that. How to create a
function just to capitalised each word (substring) in a list of
words/strings. This will be very useful and create great impact.

From here:

https://www.grammarly.com/blog/14-of-the-longest-words-in-english/

uncopyrightable

where would you split that into words?:

Some 'words' I see:

un
unc
copy
copyright
right
table
able

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

Hi, Adrian Klaver,

In Python, there is a capwords. Do we have an equivalent in Postgres?
Regards, David

#13Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#12)
Re: Fault with initcap

On 10/12/21 16:03, Shaozhong SHI wrote:

Hi, Adrian Klaver,

In Python, there is  a capwords.  Do we have an equivalent in Postgres?

https://docs.python.org/3/library/string.html?highlight=capwords#string.capwords

string.capwords(s, sep=None)

Split the argument into words using str.split(), capitalize each
word using str.capitalize(), and join the capitalized words using
str.join(). If the optional second argument sep is absent or None, runs
of whitespace characters are replaced by a single space and leading and
trailing whitespace are removed, otherwise sep is used to split and join
the words.

That is not going to do what you are proposing either as it still comes
down to deciding where to split the 'words':

import string

string.capwords('notemachine')

'Notemachine'

There are similar functions to split strings here:

https://www.postgresql.org/docs/14/functions-string.html

Though again they depend on some delimiter or regexp to make the split.

There is no function that just 'knows' that 'notemachine' is two words
and should become 'NoteMachine'.

Regards, David

--
Adrian Klaver
adrian.klaver@aklaver.com

#14Lee Hachadoorian
Lee.Hachadoorian+L@gmail.com
In reply to: Adrian Klaver (#13)
Re: Fault with initcap

On Tue, Oct 12, 2021 at 8:28 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

There is no function that just 'knows' that 'notemachine' is two words
and should become 'NoteMachine'.

Any chance that 'notemachine' is stored with a zero-width space (Unicode
U+200B)? This is common for compound words where the programmer wants to be
able to "know" that they are distinct words that are smashed together.

select initcap('notemachine'), initcap('note' || U&'\200B' || 'machine');

initcap |initcap |
-----------+------------+
Notemachine|NoteMachine|

You could check the length() of the string. If it is 12, there is an
invisible character in there. This would explain the "unexpected" behavior
that no one seems to be able to replicate.

Best,
--Lee

--
Lee Hachadoorian
Assistant Professor of Instruction, Geography and Urban Studies
Assistant Director, Professional Science Master's in GIS
Temple University

#15Metin Ulusinan
metin.ulusinan@ssicilian.net
In reply to: Shaozhong SHI (#1)
Re: Fault with initcap

I tried both on PostgreSQL 11.13 and yes "Sainsbury's bank" 's "S" is
buggy, but notemachine is not.

Will anyone try @ 14?

n s version
Notemachine Sainsbury'S Bank PostgreSQL 11.13 (Debian 11.13-0+deb10u1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

On Tue, Oct 12, 2021 at 7:30 PM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

Show quoted text

I tried initcap and found a major problem with it.

Initcap of notemachine is NoteMachine.

Initcap of Sainsbury's Bank is Sainsbury'S bank.

This is not expected.

Anyway to get around this problem?

Regards,

David

#16Pavel Stehule
pavel.stehule@gmail.com
In reply to: Metin Ulusinan (#15)
Re: Fault with initcap

st 13. 10. 2021 v 8:34 odesílatel Metin Ulusinan <
metin.ulusinan@ssicilian.net> napsal:

I tried both on PostgreSQL 11.13 and yes "Sainsbury's bank" 's "S" is
buggy, but notemachine is not.

Will anyone try @ 14?

It cannot to work everywhere - the word is defined (for this function) as
sequence of alpha numeric chars. "'" is not alpha numeric char.

Theoretically it can work if you use ICU collation provider, if it is
supported by ICU.

Regards

Pavel

#17Shaozhong SHI
shishaozhong@gmail.com
In reply to: Metin Ulusinan (#15)
Re: Fault with initcap

On Wed, 13 Oct 2021 at 07:33, Metin Ulusinan <metin.ulusinan@ssicilian.net>
wrote:

I tried both on PostgreSQL 11.13 and yes "Sainsbury's bank" 's "S" is
buggy, but notemachine is not.

Will anyone try @ 14?

n s version
Notemachine Sainsbury'S Bank PostgreSQL 11.13 (Debian 11.13-0+deb10u1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Hello, Metin Ulusinan,

I can confirm that you pin-out the issue.

There is a INITCAP2. Alternative function to PostgreSQL builtin
initcap(text) with support for accented words. · GitHub
<https://gist.github.com/pv8/8291531&gt;

Can this be adapted to capitalise the first letter of each work in a string?

Regards,

David

#18Metin Ulusinan
metin.ulusinan@ssicilian.net
In reply to: Shaozhong SHI (#17)
Re: Fault with initcap

Hi,
Yes, this can be adaptable, and i did simple version of this.
It just split text words with find spaces, capitalise each
pieces(word) and merge together again.
This is a quick and simple work. You can develop over it about your needs.

Try that and tell us about result.

CREATE OR REPLACE FUNCTION initcap2(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE
sentence TEXT := '';
word_array TEXT[];
word TEXT;
word_out TEXT;
BEGIN
sentence := $1;

IF sentence is NULL THEN
RETURN NULL;
END IF;

word_array := regexp_split_to_array($1, E'\\s+');
FOREACH word IN ARRAY word_array
LOOP
word_out := upper(left(word, 1)) || lower(substring(word, 2));
sentence := regexp_replace(sentence, word, word_out);
END LOOP;

RETURN trim(sentence);
END;
$function$
;

On Wed, Oct 13, 2021 at 11:48 AM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

Show quoted text

On Wed, 13 Oct 2021 at 07:33, Metin Ulusinan <metin.ulusinan@ssicilian.net>
wrote:

I tried both on PostgreSQL 11.13 and yes "Sainsbury's bank" 's "S" is
buggy, but notemachine is not.

Will anyone try @ 14?

n s version
Notemachine Sainsbury'S Bank PostgreSQL 11.13 (Debian 11.13-0+deb10u1)
on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Hello, Metin Ulusinan,

I can confirm that you pin-out the issue.

There is a INITCAP2. Alternative function to PostgreSQL builtin
initcap(text) with support for accented words. · GitHub
<https://gist.github.com/pv8/8291531&gt;

Can this be adapted to capitalise the first letter of each work in a
string?

Regards,

David

#19Jain, Ankit
Ankit.Jain@snapon.com
In reply to: Metin Ulusinan (#18)
RE: Fault with initcap

This was reported in PG 8.1.4 15 years ago - https://pgsql-bugs.postgresql.narkive.com/R7HmTKef/bug-2579-initcap-should-not-capitalize-letter-after-apostrophe

From: Metin Ulusinan <metin.ulusinan@ssicilian.net>
Sent: Wednesday, October 13, 2021 5:39 AM
To: Shaozhong SHI <shishaozhong@gmail.com>
Cc: pgsql-sql <pgsql-sql@lists.postgresql.org>
Subject: Re: Fault with initcap

CAUTION: This email originated from outside of Snap-on. Do not click on links or open attachments unless you have validated the sender, even if it is a known contact. Contact the sender by phone to validate the contents.
Hi,
Yes, this can be adaptable, and i did simple version of this.
It just split text words with find spaces, capitalise each
pieces(word) and merge together again.
This is a quick and simple work. You can develop over it about your needs.

Try that and tell us about result.

CREATE OR REPLACE FUNCTION initcap2(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE
sentence TEXT := '';
word_array TEXT[];
word TEXT;
word_out TEXT;
BEGIN
sentence := $1;

IF sentence is NULL THEN
RETURN NULL;
END IF;

word_array := regexp_split_to_array($1, E'\\s+');
FOREACH word IN ARRAY word_array
LOOP
word_out := upper(left(word, 1)) || lower(substring(word, 2));
sentence := regexp_replace(sentence, word, word_out);
END LOOP;

RETURN trim(sentence);
END;
$function$
;

On Wed, Oct 13, 2021 at 11:48 AM Shaozhong SHI <shishaozhong@gmail.com<mailto:shishaozhong@gmail.com>> wrote:

On Wed, 13 Oct 2021 at 07:33, Metin Ulusinan <metin.ulusinan@ssicilian.net<mailto:metin.ulusinan@ssicilian.net>> wrote:
I tried both on PostgreSQL 11.13 and yes "Sainsbury's bank" 's "S" is buggy, but notemachine is not.

Will anyone try @ 14?

n

s

version

Notemachine

Sainsbury'S Bank

PostgreSQL 11.13 (Debian 11.13-0+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

Hello, Metin Ulusinan,

I can confirm that you pin-out the issue.

There is a INITCAP2. Alternative function to PostgreSQL builtin initcap(text) with support for accented words. · GitHub<https://urldefense.com/v3/__https:/gist.github.com/pv8/8291531__;!!Lf_9VycLqA!yapR_3tOx6q4nbC3ZDtCTzp8h2LwdrDNJWxNqPOj8WHb5IIUeDb_3WXMt8MOpJOJ$&gt;

Can this be adapted to capitalise the first letter of each work in a string?

Regards,

David

#20Shaozhong SHI
shishaozhong@gmail.com
In reply to: Metin Ulusinan (#18)
Re: Fault with initcap

On Wed, 13 Oct 2021 at 10:39, Metin Ulusinan <metin.ulusinan@ssicilian.net>
wrote:

Hi,
Yes, this can be adaptable, and i did simple version of this.
It just split text words with find spaces, capitalise each
pieces(word) and merge together again.
This is a quick and simple work. You can develop over it about your needs.

Try that and tell us about result.

CREATE OR REPLACE FUNCTION initcap2(text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE
sentence TEXT := '';
word_array TEXT[];
word TEXT;
word_out TEXT;
BEGIN
sentence := $1;

IF sentence is NULL THEN
RETURN NULL;
END IF;

word_array := regexp_split_to_array($1, E'\\s+');
FOREACH word IN ARRAY word_array
LOOP
word_out := upper(left(word, 1)) || lower(substring(word, 2));
sentence := regexp_replace(sentence, word, word_out);
END LOOP;

RETURN trim(sentence);
END;
$function$
;

Hello, Metin,

See the following testing response.

ERROR: invalid regular expression: parentheses () not balanced CONTEXT:
PL/pgSQL function testinitcap2(text) line 18 at assignment
SQL state: 2201B

Regards,

David

#21Metin Ulusinan
metin.ulusinan@ssicilian.net
In reply to: Shaozhong SHI (#20)
#22Shaozhong SHI
shishaozhong@gmail.com
In reply to: Metin Ulusinan (#21)