dict_synonym.c: fix truncation of multibyte sequence

Started by Jeff Davis20 days ago5 messageshackers

pgsql@j-davis.com

20 days ago

If case_sensitive is false and str_tolower() changes the byte length of
the string, then outlen will be incorrect.

Fortunately, pnstrdup() also stops at a NUL terminator, so it will
never overrun; but if outlen is calculated to be too small, then it
could cause truncation. In any case, the input comes from a trusted
source (dictionary configuration), so it's not very serious.

The correct value of outlen is strlen(d->syn[cur].out). But it's only
ever used in one place, which is a call to pnstrdup(). Given that the
string is NUL-terminated anyway, it's easier to fix it by just changing
that to a pstrdup(). Patch attached, backpatch all the way.

Regards,
Jeff Davis

Tristan Partin

tristan@partin.io

19 days ago

In reply to: Jeff Davis (#1)

Re: dict_synonym.c: fix truncation of multibyte sequence

On Thu Jun 4, 2026 at 10:07 PM UTC, Jeff Davis wrote:

If case_sensitive is false and str_tolower() changes the byte length of
the string, then outlen will be incorrect.

Fortunately, pnstrdup() also stops at a NUL terminator, so it will
never overrun; but if outlen is calculated to be too small, then it
could cause truncation. In any case, the input comes from a trusted
source (dictionary configuration), so it's not very serious.

The correct value of outlen is strlen(d->syn[cur].out). But it's only
ever used in one place, which is a call to pnstrdup(). Given that the
string is NUL-terminated anyway, it's easier to fix it by just changing
that to a pstrdup(). Patch attached, backpatch all the way.

The fix looks and sounds good. Do we have any way to test this, so it
doesn't regress in the future? Do we need to export a module to test
through SQL?

--
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)

Jeff Davis

pgsql@j-davis.com

19 days ago

In reply to: Tristan Partin (#2)

Re: dict_synonym.c: fix truncation of multibyte sequence

On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:

In any case, the input comes from a trusted
source (dictionary configuration), so it's not very serious.

The fix looks and sounds good. Do we have any way to test this, so it
doesn't regress in the future?

-- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
$ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn

CREATE TEXT SEARCH DICTIONARY mb_syn (
TEMPLATE = synonym,
SYNONYMS = mbtest);

SELECT ts_lexize('mb_syn', 'foo');

=# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
ts_lexize
-----------
{bar}
(1 row)

=# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
ts_lexize
-----------
{barⱥ}
(1 row)

It requires a specially-crafted synonym file, and I'm not sure it's
worth much effort to add a test for this specific path. If we see
similar bugs, it's more likely to be somewhere else that makes the same
faulty assumption.

If you do think we should add tests, we should probably add a set of
dictionary-related files (.syn, .dict, .ths, etc.) that contain a
variety of adversarial Unicode cases.

I'd be inclined to just commit this fix for now. It needs backpatching,
and I don't think we want to backpatch a large set of tests with it.

Regards,
Jeff Davis

Tristan Partin

tristan@partin.io

19 days ago

In reply to: Jeff Davis (#3)

Re: dict_synonym.c: fix truncation of multibyte sequence

On Fri Jun 5, 2026 at 5:37 PM UTC, Jeff Davis wrote:

On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:

In any case, the input comes from a trusted
source (dictionary configuration), so it's not very serious.

The fix looks and sounds good. Do we have any way to test this, so it
doesn't regress in the future?

-- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
$ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn

CREATE TEXT SEARCH DICTIONARY mb_syn (
TEMPLATE = synonym,
SYNONYMS = mbtest);

SELECT ts_lexize('mb_syn', 'foo');

=# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
ts_lexize
-----------
{bar}
(1 row)

=# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
ts_lexize
-----------
{barⱥ}
(1 row)

It requires a specially-crafted synonym file, and I'm not sure it's
worth much effort to add a test for this specific path. If we see
similar bugs, it's more likely to be somewhere else that makes the same
faulty assumption.

If you do think we should add tests, we should probably add a set of
dictionary-related files (.syn, .dict, .ths, etc.) that contain a
variety of adversarial Unicode cases.

I'd be inclined to just commit this fix for now. It needs backpatching,
and I don't think we want to backpatch a large set of tests with it.

I would say proceed as you see fit. I guess I am generally of the
opinion that additional testing is generally always better, but I don't
want to push for something if others don't see the same value. I'd be
happy to provide a patch for the test in a subsequent discussion if that
is a good middle ground?

--
Tristan Partin
PostgreSQL Contributors Team
AWS (https://aws.amazon.com)

Jeff Davis

pgsql@j-davis.com

19 days ago

In reply to: Tristan Partin (#4)

Re: dict_synonym.c: fix truncation of multibyte sequence

On Fri, 2026-06-05 at 20:46 +0000, Tristan Partin wrote:

I would say proceed as you see fit. I guess I am generally of the
opinion that additional testing is generally always better, but I
don't
want to push for something if others don't see the same value. I'd be
happy to provide a patch for the test in a subsequent discussion if
that
is a good middle ground?

Yes, sounds good.

Regards,
Jeff Davis

dict_synonym.c: fix truncation of multibyte sequence

Attachments: