Regular Expression For Duplicate Words

Started by Shaozhong SHIabout 4 years ago5 messagesgeneral
Jump to latest
#1Shaozhong SHI
shishaozhong@gmail.com

This link is interesting.

regex - Regular Expression For Duplicate Words - Stack Overflow
<https://stackoverflow.com/questions/2823016/regular-expression-for-duplicate-words&gt;

Is there any example in Postgres?

Regards,

David

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Shaozhong SHI (#1)
Re: Regular Expression For Duplicate Words

On Wed, Feb 2, 2022 at 1:00 AM Shaozhong SHI <shishaozhong@gmail.com> wrote:

This link is interesting.

regex - Regular Expression For Duplicate Words - Stack Overflow
<https://stackoverflow.com/questions/2823016/regular-expression-for-duplicate-words&gt;

Is there any example in Postgres?

Not that I'm immediately aware of, and I'm not going to search the internet
for you.

The regex capabilities in PostgreSQL are pretty full-featured so a solution
should be possible. You should try translating the SO post concepts into
PostgreSQL yourself and ask specific questions if you get stuck.

David J.

#3jian he
jian.universality@gmail.com
In reply to: David G. Johnston (#2)
Re: Regular Expression For Duplicate Words

It's an interesting question. But I also don't know how to do it in
PostgreSQL.
But I figured out alternative solutions.

GNU Grep: grep -E '(hello)[[:blank:]]+\1' <<<'one hello hello world'
ripgrep: rg '(hello)[[:blank:]]+\1' --pcre2 <<<'one hello hello world'

On Wed, Feb 2, 2022 at 8:53 PM David G. Johnston <david.g.johnston@gmail.com>
wrote:

Show quoted text

On Wed, Feb 2, 2022 at 1:00 AM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

This link is interesting.

regex - Regular Expression For Duplicate Words - Stack Overflow
<https://stackoverflow.com/questions/2823016/regular-expression-for-duplicate-words&gt;

Is there any example in Postgres?

Not that I'm immediately aware of, and I'm not going to search the
internet for you.

The regex capabilities in PostgreSQL are pretty full-featured so a
solution should be possible. You should try translating the SO post
concepts into PostgreSQL yourself and ask specific questions if you get
stuck.

David J.

#4Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Shaozhong SHI (#1)
Re: Regular Expression For Duplicate Words

On 2022-02-02 08:00:00 +0000, Shaozhong SHI wrote:

regex - Regular Expression For Duplicate Words - Stack Overflow

Is there any example in Postgres?

It's pretty much the same as with other regexp dialects: User word
boundaries and a word character class to match any word and then use a
backreference to match a duplicate word. All the building blocks are
described on
https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP
and except for [[:<:]] and [[:>:]] for the word boundaries, they are
also pretty standard.

So

[[:<:]] start of word
([[:alpha:]]+) one or more alphabetic characters in a capturing group
[[:>:]] end of word
\W+ one or more non-word characters
[[:<:]] start of word
\1 the content of the first (and only) capturing group
[[:>:]] end of word

All together:

select * from t where t ~ '[[:<:]]([[:alpha:]]+)[[:>:]]\W[[:<:]]\1[[:>:]]';

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#5Shaozhong SHI
shishaozhong@gmail.com
In reply to: Peter J. Holzer (#4)
Re: Regular Expression For Duplicate Words

Hi, Peter, Interesting.

On Thu, 3 Feb 2022 at 19:48, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2022-02-02 08:00:00 +0000, Shaozhong SHI wrote:

regex - Regular Expression For Duplicate Words - Stack Overflow

Is there any example in Postgres?

It's pretty much the same as with other regexp dialects: User word
boundaries and a word character class to match any word and then use a
backreference to match a duplicate word. All the building blocks are
described on

https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP
and except for [[:<:]] and [[:>:]] for the word boundaries, they are
also pretty standard.

So

[[:<:]] start of word
([[:alpha:]]+) one or more alphabetic characters in a capturing group
[[:>:]] end of word
\W+ one or more non-word characters
[[:<:]] start of word
\1 the content of the first (and only) capturing group
[[:>:]] end of word

All together:

select * from t where t ~ '[[:<:]]([[:alpha:]]+)[[:>:]]\W[[:<:]]\1[[:>:]]';

Give a good example if you can.

Regards,

David