Match 2 words and more

Started by Shaozhong SHIover 4 years ago9 messagesgeneral
Jump to latest
#1Shaozhong SHI
shishaozhong@gmail.com

this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?

Regards,

David

#2Rob Sargent
robjsargent@gmail.com
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On Nov 27, 2021, at 5:27 PM, Shaozhong SHI <shishaozhong@gmail.com> wrote:


this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?

Regards,

Show the results of the query. Describe how those results differ from what you desire or from what you expect.

Show quoted text

David

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On 2021-Nov-28, Shaozhong SHI wrote:

this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

But, it finds only one word as well.

How about something like this?

'^([[:<:]][[:alpha:]]+[[:>:]]( |$)){2}$'

You have:
- the ^ is a constraint that matches start of string
- you have a ( ... ){2}$ construct which means "match exactly twice" and
then match end-of-string
- Inside the parens of that construct, you match:
- [[:<:]] which means start-of-word
- [[:alpha:]]+ which means "a non-empty set of alphabetical chars"
- [[:>:]] which means end-of-word
- ( |$) for "either a space or end-of-string"

You can perhaps simplify by removing the [[:<:]] and [[:>:]]
constraints, so '^([[:alpha:]]+( |$)){2}$'

To mean "between two and four", change the {2} to {2,4}. If you want
"two or more", try {2,}.

You could change the ( |$) to ([[:white:]]+|$) in order to accept more
than one space between words, or combinations of space and tabs and
newlines and so on.

With a decent set of data, you could probably notice some other problems
in this regexp, but at least it should be a decent start.

It appears that regex is not robust.

Nah.

--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/

#4Guyren Howe
guyren@gmail.com
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On Nov 27, 2021, at 16:27 , Shaozhong SHI <shishaozhong@gmail.com> wrote:

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$’;

The simplest thing that does what you says is I think:

select name FROM a_table where "STREET_NAME" ~ ‘^([[:alpha:]]+\s)+[[:alpha:]]+$’;

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On Sat, Nov 27, 2021 at 5:27 PM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+
]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?

You put the space inside the first character class (the plus sign within
that class is also a literal plus, not a "one or more" indicator) thus
you've made the space itself optional. Did you maybe mean to use ( )
parens instead of [ ]?

When asking regex questions it is usually more helpful to formulate them
like so:

SELECT 'this should match' ~ '<regex here>';

David J.

#6Ron
ronljohnsonjr@gmail.com
In reply to: Alvaro Herrera (#3)
Re: Match 2 words and more

On 11/27/21 6:49 PM, Alvaro Herrera wrote:

On 2021-Nov-28, Shaozhong SHI wrote:

[snip]

It appears that regex is not robust.

This was my attitude when I first started to learn computer programming. /My
code is perfect, i//t can't be my fault!! There must be a compiler bug!!!/

Nah.

Just incomprehensible line noise!!

--
Angular momentum makes the world go 'round.

#7Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On 2021-11-28 00:27:34 +0000, Shaozhong SHI wrote:

this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

I think you meant

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[[:alpha:]]+$';

Note the extra two brackets.

The character classes (like [:alpha:] or [:digit:] can only be used
within bracket expressions. So you have to put brackets around the
second [[:alpha:], too (like you did for the first one).

But if you look more closely at the first one, you will notice that it
doesn't do what you want, either: It matches any non-empty sequence of
alpabetic characters, plus signs and spaces. But you almost certainly
don't want to match a plus sign, and you don't want space and alphabetic
characters to be interchangable. You want some alphabetic characters
followed by a space. So this becomes

select name FROM a_table where "STREET_NAME" ~ '^([[:alpha:]]+ )+[[:alpha:]]+$';

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#8Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

On 2021-11-28 00:27:34 +0000, Shaozhong SHI wrote:

It appears that regex is not robust.

It does reliably what you tell it to do.

I would agree that after almost 50 years of extending it, the syntax has
become a bit of a mess.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#9Thomas Markus
t.markus@proventis.net
In reply to: Shaozhong SHI (#1)
Re: Match 2 words and more

Am 28.11.21 um 01:27 schrieb Shaozhong SHI:

this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+
]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?

Regards,

David

Hi,

It's robust, but syntax is sometimes weired

for words I would use something like (contains numbers too)
"STREET_NAME" ~ '(\w+\s+)+\w+';

or alpha only
"STREET_NAME" ~ '([[:alpha:]]+\s+)+[[:alpha:]]+'

regards
Thomas