BUG #19525: In `contrib/dict_int`, handling a token whose first byte is a null byte causes `pnstrdup()` .

Started by PG Bug reporting form1 day ago2 messagesbugs

noreply@postgresql.org

1 day ago

The following bug has been logged on the website:

Bug reference: 19525
Logged by: Yuelin Wang
Email address: 3020001251@tju.edu.cn
PostgreSQL version: 19beta1
Operating system: Linux (Ubuntu 24.04, x86_64)
Description:

**Component**: `contrib/dict_int/dict_int.c`, function `dintdict_lexize()`
(line 109)

Requires a `SQL_ASCII`-encoded database (to bypass null-byte encoding
checks) and superuser to install the extension and create a helper function
that passes a `bytea` token directly to the lexize callback. Once the
dictionary is created, any role granted `EXECUTE` on the helper can trigger
the crash.

```sql
-- 1. Create SQL_ASCII database (null bytes are not rejected)
CREATE DATABASE vuln_ascii ENCODING 'SQL_ASCII' TEMPLATE template0;
\c vuln_ascii

-- 2. Install extension and create an intdict dictionary with
REJECTLONG=false
CREATE EXTENSION dict_int;
CREATE TEXT SEARCH DICTIONARY intdict_test (
TEMPLATE = intdict_template,
MAXLEN = 8192,
REJECTLONG = false
);

-- 3. Create a C helper (raw_lexize.so) that invokes the lexize callback
with
-- a raw bytea token, bypassing the text encoding layer.
CREATE FUNCTION raw_lexize(dict regdictionary, token bytea)
RETURNS text[] AS 'raw_lexize', 'raw_lexize' LANGUAGE C STRICT;

-- 4. Trigger: null byte at position 0 causes pnstrdup to allocate 1 byte,
-- but txt[8192] = '\0' writes 8191 bytes past the end of the allocation.
SELECT raw_lexize('intdict_test',
decode('00' || repeat('78', 10000), 'hex'));
-- Server closes connection; ASan reports heap-buffer-overflow WRITE of size
1
-- at dict_int.c:109 in dintdict_lexize.
```

ASan confirmation (server killed the backend; connection dropped):
```
==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x525000052880
WRITE of size 1 at 0x525000052880 thread T0
#0 in dintdict_lexize
/data/ylwang/Projects/postgres/contrib/dict_int/dict_int.c:109
#1 in FunctionCall4Coll .../src/backend/utils/fmgr/fmgr.c:1215
#2 in raw_lexize /tmp/raw_lexize.c:37
SUMMARY: AddressSanitizer: heap-buffer-overflow
.../contrib/dict_int/dict_int.c:109 in dintdict_lexize
```

`pnstrdup(ptr, len)` uses `strnlen(ptr, len)` internally, so when the token
begins with a null byte it allocates only 1 byte. The variable `len` is not
updated to reflect this and retains the original token length, so the guard
at line 98 (`if (len > d->maxlen)`) passes, and line 109 writes `'\0'` at
offset `d->maxlen` (e.g., 8192) into a 1-byte allocation.

The fix is to recompute the effective length from the allocated buffer after
the `pnstrdup` call, for example by replacing the `if (len > d->maxlen)`
check with `if (strlen(txt) > d->maxlen)`. This ensures the truncation
offset is always within the bounds of what `pnstrdup` actually allocated.

Ayush Tiwari

ayushtiwari.slg01@gmail.com

about 21 hours ago

In reply to: PG Bug reporting form (#1)

Re: BUG #19525: In `contrib/dict_int`, handling a token whose first byte is a null byte causes `pnstrdup()` .

Hi,

On Thu, 18 Jun 2026 at 18:54, PG Bug reporting form <noreply@postgresql.org>
wrote:

The following bug has been logged on the website:

Bug reference: 19525
Logged by: Yuelin Wang
Email address: 3020001251@tju.edu.cn
PostgreSQL version: 19beta1
Operating system: Linux (Ubuntu 24.04, x86_64)
Description:

**Component**: `contrib/dict_int/dict_int.c`, function `dintdict_lexize()`
(line 109)

Requires a `SQL_ASCII`-encoded database (to bypass null-byte encoding
checks) and superuser to install the extension and create a helper function
that passes a `bytea` token directly to the lexize callback. Once the
dictionary is created, any role granted `EXECUTE` on the helper can trigger
the crash.

```sql
-- 1. Create SQL_ASCII database (null bytes are not rejected)
CREATE DATABASE vuln_ascii ENCODING 'SQL_ASCII' TEMPLATE template0;
\c vuln_ascii

-- 2. Install extension and create an intdict dictionary with
REJECTLONG=false
CREATE EXTENSION dict_int;
CREATE TEXT SEARCH DICTIONARY intdict_test (
TEMPLATE = intdict_template,
MAXLEN = 8192,
REJECTLONG = false
);

-- 3. Create a C helper (raw_lexize.so) that invokes the lexize callback
with
-- a raw bytea token, bypassing the text encoding layer.
CREATE FUNCTION raw_lexize(dict regdictionary, token bytea)
RETURNS text[] AS 'raw_lexize', 'raw_lexize' LANGUAGE C STRICT;

-- 4. Trigger: null byte at position 0 causes pnstrdup to allocate 1 byte,
-- but txt[8192] = '\0' writes 8191 bytes past the end of the
allocation.
SELECT raw_lexize('intdict_test',
decode('00' || repeat('78', 10000), 'hex'));
-- Server closes connection; ASan reports heap-buffer-overflow WRITE of
size
1
-- at dict_int.c:109 in dintdict_lexize.
```

ASan confirmation (server killed the backend; connection dropped):
```
==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x525000052880
WRITE of size 1 at 0x525000052880 thread T0
#0 in dintdict_lexize
/data/ylwang/Projects/postgres/contrib/dict_int/dict_int.c:109
#1 in FunctionCall4Coll .../src/backend/utils/fmgr/fmgr.c:1215
#2 in raw_lexize /tmp/raw_lexize.c:37
SUMMARY: AddressSanitizer: heap-buffer-overflow
.../contrib/dict_int/dict_int.c:109 in dintdict_lexize

Thanks for the report and repro!

`pnstrdup(ptr, len)` uses `strnlen(ptr, len)` internally, so when the token

begins with a null byte it allocates only 1 byte. The variable `len` is not
updated to reflect this and retains the original token length, so the guard
at line 98 (`if (len > d->maxlen)`) passes, and line 109 writes `'\0'` at
offset `d->maxlen` (e.g., 8192) into a 1-byte allocation.

The fix is to recompute the effective length from the allocated buffer
after
the `pnstrdup` call, for example by replacing the `if (len > d->maxlen)`
check with `if (strlen(txt) > d->maxlen)`. This ensures the truncation
offset is always within the bounds of what `pnstrdup` actually allocated.

Your analysis seems right to me.

While looking around I think dict_xsyn may have a related issue: in
dxsyn_lexize() the token is copied with pnstrdup() and the original
length is then handed to str_tolower(), which reads that many bytes and
so could read past the shorter copy.

Attaching a patch that fixes both the above issues.

Regards,
Ayush

BUG #19525: In `contrib/dict_int`, handling a token whose first byte is a null byte causes `pnstrdup()` .

Attachments: