Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

Started by Blessy Thomasabout 1 month ago3 messageshackersgeneral

blessy456bthomas@gmail.com

about 1 month ago

hackersgeneral

Hello PostgreSQL Community,

I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.

1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).

3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
pip install indic-transliteration python-Levenshtein

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas

Andreas Karlsson

andreas.karlsson@percona.com

about 1 month ago

In reply to: Blessy Thomas (#1)

hackersgeneral

Re: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

On 3/2/26 8:25 AM, Blessy Thomas wrote:

6. Feedback Requested

I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a
starting point for discussion about multilingual data handling inside
PostgreSQL.

Looking forward to your thoughts and critiques.

Hi,

For this kind of extension I think the main thing to do is to build a
proof of concept and see if there is any interest. Plus here on -hackers
is not the right place to ask anyway since this list is mostly concerned
with hacking on PostgreSQL and not concerned with writing extensions.

There are other places where actual PostgreSQL users hang out, e.g. the
-general mailing list, but in those places I would still recommend
showing up with a PoC extension. People are much more interested in
giving feedback if there is some code rather than giving feedback to a
plan for something which may not even get built.

Andreas

Blessy Thomas

blessy456bthomas@gmail.com

13 days ago

In reply to: Blessy Thomas (#1)

hackersgeneral

Fwd: Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

---------- Forwarded message ---------
From: Blessy Thomas <blessy456bthomas@gmail.com>
Date: Mon, 2 Mar 2026 at 12:55
Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
matching extension for PostgreSQL

Hello PostgreSQL Community,

For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.

2. What This Extension Does

4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
pip install indic-transliteration python-Levenshtein

6. Feedback Requested

Looking forward to your thoughts and critiques.
Thank you!

Regards
Blessy Thomas

Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL

Attachments:

Attachments: