Extension - multilingual_fuzzy_match : Multilingual phonetic matching extension for PostgreSQL
Hello PostgreSQL Community,
I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.
1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages
For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.
2. What This Extension Does
- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).
3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables
4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
pip install indic-transliteration python-Levenshtein
5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
id | name | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
1 | राहुल | rAhula | rahul | rahul | 0
2 | রাহুল | rAhula | rahul | rahul | 0
4 | ರಾಹುಲ್ | rAhul | rahul | rahul | 0
5 | Rahul | Rahul | rahul | rahul | 0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------
6. Feedback Requested
I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.
Looking forward to your thoughts and critiques.
Thank you!
Regards
Blessy Thomas
Attachments:
Screenshot from 2026-03-02 12-29-45.pngimage/png; name="Screenshot from 2026-03-02 12-29-45.png"Download+0-1
On 3/2/26 8:25 AM, Blessy Thomas wrote:
6. Feedback Requested
I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a
starting point for discussion about multilingual data handling inside
PostgreSQL.Looking forward to your thoughts and critiques.
Hi,
For this kind of extension I think the main thing to do is to build a
proof of concept and see if there is any interest. Plus here on -hackers
is not the right place to ask anyway since this list is mostly concerned
with hacking on PostgreSQL and not concerned with writing extensions.
There are other places where actual PostgreSQL users hang out, e.g. the
-general mailing list, but in those places I would still recommend
showing up with a PoC extension. People are much more interested in
giving feedback if there is some code rather than giving feedback to a
plan for something which may not even get built.
Andreas
---------- Forwarded message ---------
From: Blessy Thomas <blessy456bthomas@gmail.com>
Date: Mon, 2 Mar 2026 at 12:55
Subject: Extension - multilingual_fuzzy_match : Multilingual phonetic
matching extension for PostgreSQL
Hello PostgreSQL Community,
I would like to introduce a PostgreSQL extension called
multilingual_fuzzy_match. This extension enables multilingual name
normalization, transliteration, and fuzzy phonetic matching directly inside
PostgreSQL at query time.
1. What Problem It Solves:
In multilingual datasets (especially Indian language datasets), the same
name may appear in:
- Different scripts
- Different transliterations
- Slight spelling variations
- Multiple languages
For example:
राम ≈ Raam ≈ رَام ≈ ராம்
Traditional equality or LIKE queries fail in such cases. Even trigram
matching doesn’t fully address cross-script phonetic similarity.
2. What This Extension Does
- Detects the script of the input text
- Performs transliteration and normalization
- Generates a phonetic key
- Uses Levenshtein distance (via python-Levenshtein)
- Returns similarity-scored results
All of this happens inside PostgreSQL using PL/Python (plpython3u).
3. Key Features
- No schema changes required
- Query-level matching
- Supports 11 major Indian scripts:
Devanagari, Tamil, Telugu, Bengali, Urdu, Malayalam, Kannada, Odia,
Gujarati, Punjabi
- Works on existing tables
4. Requirements
- PostgreSQL 17 (compiled with Python support)
- Python 3.12+
- plpython3u
- Python packages:
pip install indic-transliteration python-Levenshtein
5. Example Usage
-----------------------------------------------------------------------------------------------------------------------------
postgres=#
SELECT * FROM fuzzy_match('names_native_dist', 'name', 'Rahul')
WHERE distance <= 1;
id | name | translit | normalized | fuzzy | distance
----+-------+----------+------------+-------+----------
1 | राहुल | rAhula | rahul | rahul | 0
2 | রাহুল | rAhula | rahul | rahul | 0
4 | ರಾಹುಲ್ | rAhul | rahul | rahul | 0
5 | Rahul | Rahul | rahul | rahul | 0
(4 rows)
--------------------------------------------------------------------------------------------------------------------------------
6. Feedback Requested
I would really appreciate feedback from the community on:
- Extension design approach
- Performance considerations
- Suitability for PGXN submission
I would love suggestions, improvements, and any guidance on making this
production-ready. I’m sharing this not just as a project, but as a starting
point for discussion about multilingual data handling inside PostgreSQL.
Looking forward to your thoughts and critiques.
Thank you!
Regards
Blessy Thomas