Remove invalid SS2/SS3 handling from EUC-KR routines

Started by SungJun Jang2 months ago7 messageshackers

sjjang112233@gmail.com

2 months ago

Hi,

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
(ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.

PostgreSQL currently has two inconsistencies with this:

1. Table 23.3 in the documentation lists EUC_KR Bytes/Char as "1-3".
2. pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len()
delegate to the shared pg_euc_* helpers, which include SS2 (0x8E)
and SS3 (0x8F) handling written for encodings that designate G2/G3
(e.g. EUC-JP, EUC-TW).

The following evidence confirms that SS2/SS3 are not part of EUC-KR:

- KS X 2901 defines EUC-KR with the following code set table
(see attached ksx2901-euc-kr-code-set-table.png):

Code set Code value representation Character set
0 0XXXXXXX KS X 1003 (ASCII)
1 1XXXXXXX 1XXXXXXX KS X 1001
2 SS2 1XXXXXXX [1XXXXXXX [...]] undefined
3 SS3 1XXXXXXX [1XXXXXXX [...]] undefined

The standard states: "In particular, since the character sets for
code set 2 and code set 3 are not defined, they may be defined and
used in the future if necessary."

- pg_euckr_verifychar() (src/common/wchar.c:1044) already has no SS2/SS3
branch; it accepts only 0x00-0x7F (G0, ASCII) and 0xA1-0xFE lead bytes
(G1, KS X 1001). Any 0x8E or 0x8F byte is rejected.

This patch fixes both:

- Replace the three delegating functions with EUC-KR-specific
implementations that recognise only G0 (1 byte) and G1 (2 bytes).
- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
- Correct Table 23.3 from "1-3" to "1-2".

pg_euckr_verifychar() already has no SS2/SS3 branch, so SS2/SS3 bytes
are never admitted as valid lead bytes. This patch therefore introduces
no behavior change for valid EUC-KR data.

This was discussed in [1]/messages/by-id/CAAAe_zBdGXsALm=GkUPtPx9MLcjcM5hBg3HZU+nh8gKXSjXJJw@mail.gmail.com.

[1]: /messages/by-id/CAAAe_zBdGXsALm=GkUPtPx9MLcjcM5hBg3HZU+nh8gKXSjXJJw@mail.gmail.com
/messages/by-id/CAAAe_zBdGXsALm=GkUPtPx9MLcjcM5hBg3HZU+nh8gKXSjXJJw@mail.gmail.com

--
SungJun Jang

Henson Choi

assam258@gmail.com

2 months ago

In reply to: SungJun Jang (#1)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Hi SungJun,

Thanks for the patch. I applied v1 on top of master; it builds
cleanly and the regression tests pass here. I agree with the
direction; a few comments inline.

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0

(ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.

Right. I checked the existing pg_euckr_verifychar() in wchar.c and
it indeed has no SS2/SS3 branch -- it accepts only ASCII and 0xA1-0xFE
lead bytes via IS_EUC_RANGE_VALID(). So the verifier and the new
mblen()/mb2wchar_with_len() now tell the same story for valid input,
which is the goal.

I also did a wider audit for EUC-KR-specific SS2/SS3 handling outside
wchar.c and found none: the UTF-8 <-> EUC-KR conversion proc is
clean, pg_eucjp_increment in mbutils.c is EUC-JP only, and EUC-KR
falls back to pg_generic_charinc which delegates to
pg_euckr_verifychar (already SS2/SS3-free). So the three functions
this patch rewrites are the only entry points; no dangling SS2/SS3
path remains for EUC-KR after the patch.

- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].

This is the only user-visible change, via pg_encoding_max_length('EUC_KR')
(see mbutils.c). The value drops from 3 to 2. I don't think any
real client code relies on 3, but the release notes should mention it.

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

+1 from me.

---- Side note on EUC-CN ----

GB 2312 under EUC-CN appears to be in the same standards situation
-- the existing pg_euccn_mblen comment in wchar.c states "CS2 and
CS3 are not defined for EUC_CN", and the af79c30dc3e commit message
similarly says "EUC_CN supports only 1- and 2-byte sequences (CS0,
CS1)" -- yet pg_euccn_* still carries SS2/SS3 branches and keeps
maxmblen=3. As I read it, that shape was a
deliberate choice in commit af79c30dc3e ("Fix encoding length for
EUC_CN", CVE-2026-2006) -- the minimal back-patchable fix -- and
the commit message seems to leave the door open for master to
"harmonize in a different direction", though I may be reading more
into it than was intended.

An analogous self-contained cleanup of EUC-CN looks like a natural
follow-up. Historically the EUC code in wchar.c was shaped by
Japanese and Western contributors -- which is why the shared
pg_euc_* helpers carry JIS X 0201/0212/0208 assumptions -- and
EUC-CN inherited that shape by delegation. With the Chinese
contributor community now well established in the project, an
EUC-CN cleanup feels like a natural fit for contributors closer to
that ecosystem, who can also supply native test data, in the same
way KS X 2901 grounds this patch on the Korean side. Noting it
here so the idea stays on the archive; no action requested in this
thread.

Regards,
Henson Choi

Show quoted text

Michael Paquier

michael@paquier.xyz

2 months ago

In reply to: Henson Choi (#2)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

On Tue, May 12, 2026 at 08:39:04PM +0900, Henson Choi wrote:

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

I unfortunately cannot read the PNG file added at the top of the
thread, and it is a bit hard to check what you are doing here.

I can however point out two things:
- the existence of src/test/locale/, which may be impacted by what you
are changing here. Perhaps we should try to modernize a bit this
area, so as tests can become easier each time we try to change the
conversion tables?
- the proposed patch has zero regression tests. This could be used to
check the behavior before *and* after the patch.
--
Michael

SungJun Jang

sjjang112233@gmail.com

2 months ago

In reply to: Henson Choi (#2)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Hi Henson,

Thanks for the thorough review and the wider audit.

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

Added to the commit message.

I don't think any real client code relies on 3, but the release notes
should mention it.

Added a release note entry for the pg_encoding_max_length('EUC_KR')
change (3 -> 2) in doc/src/sgml/release-19.sgml.

v2 attached. No other changes from v1.
Thanks again for the review

Regards,
SungJun Jang

2026년 5월 12일 (화) 오후 8:39, Henson Choi <assam258@gmail.com>님이 작성:

Show quoted text

Hi SungJun,

Thanks for the patch. I applied v1 on top of master; it builds
cleanly and the regression tests pass here. I agree with the
direction; a few comments inline.

Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0

(ASCII) and G1 (KS X 1001). G2 and G3 are not designated; the
single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.

Right. I checked the existing pg_euckr_verifychar() in wchar.c and
it indeed has no SS2/SS3 branch -- it accepts only ASCII and 0xA1-0xFE
lead bytes via IS_EUC_RANGE_VALID(). So the verifier and the new
mblen()/mb2wchar_with_len() now tell the same story for valid input,
which is the goal.

I also did a wider audit for EUC-KR-specific SS2/SS3 handling outside
wchar.c and found none: the UTF-8 <-> EUC-KR conversion proc is
clean, pg_eucjp_increment in mbutils.c is EUC-JP only, and EUC-KR
falls back to pg_generic_charinc which delegates to
pg_euckr_verifychar (already SS2/SS3-free). So the three functions
this patch rewrites are the only entry points; no dangling SS2/SS3
path remains for EUC-KR after the patch.

- Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].

This is the only user-visible change, via pg_encoding_max_length('EUC_KR')
(see mbutils.c). The value drops from 3 to 2. I don't think any
real client code relies on 3, but the release notes should mention it.

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

+1 from me.

---- Side note on EUC-CN ----

GB 2312 under EUC-CN appears to be in the same standards situation
-- the existing pg_euccn_mblen comment in wchar.c states "CS2 and
CS3 are not defined for EUC_CN", and the af79c30dc3e commit message
similarly says "EUC_CN supports only 1- and 2-byte sequences (CS0,
CS1)" -- yet pg_euccn_* still carries SS2/SS3 branches and keeps
maxmblen=3. As I read it, that shape was a
deliberate choice in commit af79c30dc3e ("Fix encoding length for
EUC_CN", CVE-2026-2006) -- the minimal back-patchable fix -- and
the commit message seems to leave the door open for master to
"harmonize in a different direction", though I may be reading more
into it than was intended.

An analogous self-contained cleanup of EUC-CN looks like a natural
follow-up. Historically the EUC code in wchar.c was shaped by
Japanese and Western contributors -- which is why the shared
pg_euc_* helpers carry JIS X 0201/0212/0208 assumptions -- and
EUC-CN inherited that shape by delegation. With the Chinese
contributor community now well established in the project, an
EUC-CN cleanup feels like a natural fit for contributors closer to
that ecosystem, who can also supply native test data, in the same
way KS X 2901 grounds this patch on the Korean side. Noting it
here so the idea stays on the archive; no action requested in this
thread.

Regards,
Henson Choi

SungJun Jang

sjjang112233@gmail.com

2 months ago

In reply to: Michael Paquier (#3)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Hi Michael,

Thanks for the review.

I unfortunately cannot read the PNG file added at the top of the
thread, and it is a bit hard to check what you are doing here.

Apologies for that. KS standards are published in Korean only; the
following material is provided to assist verification.

Standards referenced:
- KS X 2901: "UNIX — Hangeul environment"
- KS X 1001: "Code for Information Interchange (Hangeul and Hanja)"
- KS X 1003: "Code for Information Interchange (Latin)"

The relevant clause is at:

https://standard.go.kr/KSCI/api/std/viewMachine.do?reformNo=07&tmprKsNo=KSX2901&formType=STD

Navigation: left panel -> "4 확장 유닉스 부호계-한국 EUC-KR(Extended UNIX
Code-KoRea)" -> "4.2 부호계 1(문자 집합 1)"

The table at that clause reads:

Code set Code value representation Character set used
0 0XXXXXXX KS X 1003 (ASCII)
1 1XXXXXXX 1XXXXXXX KS X 1001
2 SS2 1XXXXXXX [1XXXXXXX [...]] Undefined
3 SS3 1XXXXXXX [1XXXXXXX [...]] Undefined

Accompanying normative text:
Korean: "유닉스 시스템에서 한글을 지원하기 위해서는 한글의 확장
유닉스 부호계-한국(EUC-KR) 표현 형식을 지원해 주어야 한다.
특히 부호계(문자 집합) 2, 3에 쓰는 부호계가 정의되어 있지
않으므로 앞으로 필요하면 부호계를 정의하여 쓸 수 있다."
English: "To support Korean in UNIX systems, the EUC-KR format must
be supported. In particular, since the character sets for
code set 2 and code set 3 are not defined, they may be
defined and used in the future if necessary."

The PNG is a screenshot of this exact clause; the text above is its
full content.

- the existence of src/test/locale/, which may be impacted by what
you are changing here. Perhaps we should try to modernize a bit
this area, so as tests can become easier each time we try to
change the conversion tables?

src/test/locale/ is a collation/ctype test infrastructure: it
validates sort order and libc character-class functions (isalpha and
friends) using locale-dependent data. This patch changes the
encoding layer (maxmblen, mblen, dsplen), which that infrastructure
does not exercise. Adding an EUC-KR directory there would not verify
the invariants this patch is locking down.

There is also a practical concern: src/test/locale/ uses a 1998-era
shell harness, is not integrated into check-world, and is not run by
the buildfarm. Tests added there would provide no ongoing coverage.

I agree that modernizing src/test/locale/ is worthwhile. I would
suggest doing that as a separate follow-up thread after this patch
merges, rather than blocking on it here.

- the proposed patch has zero regression tests. This could be used
to check the behavior before *and* after the patch.

Agreed. Before submitting a new version, I would like to confirm
the approach.

I am planning to split into two patches:

0001 -- Baseline capture. Adds four tests to
src/test/regress/sql/euc_kr.sql covering
pg_encoding_max_length, 1-byte (G0/ASCII) handling,
2-byte (G1/KS X 1001) handling, and SS2 (0x8E) lead-byte
rejection. The expected output records the current
(pre-fix) behavior, so pg_encoding_max_length shows 3.
This patch is not a bug fix -- it captures what master
does today. It applies to master as-is and all tests pass.

0002 -- Actual fix: wchar.c, charset.sgml, and release notes.
The expected output changes by exactly one line:

            - 3
            + 2

All other test results remain unchanged, which is
evidence that valid EUC_KR data is unaffected.

The reason for putting the wrong value (3) in 0001's expected output
is so that 0002's diff speaks for itself: exactly one line changes,
nothing else moves. Without 0001, it would be unclear why the wrong
value appeared in the expected output in the first place.

Does this structure work for you, or would you prefer a different
approach?

Regards,
SungJun Jan

2026년 5월 13일 (수) 오후 2:14, Michael Paquier <michael@paquier.xyz>님이 작성:

Show quoted text

On Tue, May 12, 2026 at 08:39:04PM +0900, Henson Choi wrote:

One small observation: after this patch, EUC-KR's mb routines become
structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
branch, maxmblen=2), which is a nice consistency win and arguably the
right shape for "Korean 1-2 byte EUC". Could be worth a one-liner in
the commit message.

I unfortunately cannot read the PNG file added at the top of the
thread, and it is a bit hard to check what you are doing here.

I can however point out two things:
- the existence of src/test/locale/, which may be impacted by what you
are changing here. Perhaps we should try to modernize a bit this
area, so as tests can become easier each time we try to change the
conversion tables?
- the proposed patch has zero regression tests. This could be used to
check the behavior before *and* after the patch.
--
Michael

Henson Choi

assam258@gmail.com

2 months ago

In reply to: SungJun Jang (#5)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Hi Michael, SungJun,

One framing point first.

Correctness of the rewritten routines is the easy part -- for valid
EUC-KR data there is no behavior change. What deserves scrutiny is
the side-effect surface: callers that have been getting mblen=2 or 3
for 0x8E/0x8F, and the fact that such bytes can already be on disk
via routes that skip pg_verify_mbstr. After the patch those rows
are measured differently by length/substring/LIKE.

Suggested review principles:

(P1) Indirect-entry completeness -- pg_wchar_table[] should be
confirmed as the only callback registry into the EUC-KR slots.
(P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
pg_encoding_max_length and their cstring variants should be
enumerated and each call site checked for any reliance on
the old mblen=2/3 return for 0x8E/0x8F.
(P3) Verify-bypass:
(a) call-time -- code that runs mblen/dsplen on bytes not
verified in the same call chain (copyto.c file_encoding
scan, cstring pg_mblen in formatting.c / varlena.c) should
be identified.
(b) load-time -- byte-preserving carry-over that skips
pg_verify_mbstr (pg_upgrade relfilenode preservation,
physical replication / WAL replay, pre-patch on-disk data
from earlier versions) should be reasoned about.
(P4) Cross-encoding contamination -- it should be confirmed that no
Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
data after the split.
(P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
from 3 to 2; buffer-sizing / width-bounded consumers should be
checked.

Review-coverage list (not a test matrix):

Encoding-layer surface (P5)
- pg_wchar_table[PG_EUC_KR].maxmblen and readers of
pg_encoding_max_length() that switch on EUC-KR width
- pg_database_encoding_max_length() callers in mbutils.c

Direct callers of pg_mblen / pg_dsplen (P2)
- varlena.c: text_substring, text_position, text_left/right,
text_reverse, length-family
- oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
- formatting.c: to_char width handling (pg_mblen_cstr)
- varchar.c: bpchar / varchar length enforcement
(pg_mbcharcliplen)
- xml.c, jsonfuncs.c: pg_mblen on textual content

Pattern matching (P2)
- like.c: multibyte LIKE

Call-time verify-bypass (P3a)
- copyto.c: file_encoding delimiter scan
- formatting.c / varlena.c: cstring pg_mblen callers

Load-time verify-bypass (P3b) -- highest follow-up priority
- pg_upgrade: relfilenode preservation (bytes as-is)
- Physical replication / WAL replay
- Pre-patch on-disk data from earlier versions

Pre-existing-data behavior (P3b consequence)
- length/char_length/octet_length, substring/left/right/position,
LIKE / regex on rows containing 0x8E/0x8F
- Sort order / btree text indexes built before the patch
(bytewise comparator, likely safe but worth naming)

tsearch (P2)
- spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c

Conversion proc sanity
- utf8_and_euc_kr (table-driven, expected unchanged)

These are the items I think are worth a line or two of reasoning
from someone who has read the code, though I don't think they
need to block this patch -- treating the review-coverage list as
background and following up separately is a perfectly fine
outcome. I plan to work through them myself gradually, as time
allows, and share what I find along the way; other eyes would
be welcome.

----------------------------------------------------------------------

On 2026-05-13, Michael Paquier wrote:

I unfortunately cannot read the PNG file [...]

I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
transcription matches the normative text verbatim.

(On src/test/locale/: agreed with SungJun -- a separate concern
from this patch.)

- the proposed patch has zero regression tests.

Agreed on the coverage gap. SungJun's proposed 0001 baseline
tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
handling, SS2 rejection) are a sensible starting point and worth
adopting as-is. Beyond that, the review-coverage list above is
what I plan to walk as follow-up: some items will naturally
suggest further regression tests, while others (notably the
load-time bypass paths in P3b) can only be reasoned about in
review. Any additional tests can come out of that pass.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (v2):

v2 attached. [commit message and release note updated]

Commit-message update LGTM. On the release-19.sgml hunk: release
notes are curated by the release manager from commit messages
after the fact, not written into the patch. It would probably
be cleaner for v3 to omit that hunk and let the commit message
carry the user-visible change instead (pg_encoding_max_length
3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
P3b routes). The charset.sgml Table 23.3 fix should stay --
that is documented behavior, not a release note.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (reply to Michael):

[KS X 2901 clause text quoted in Korean and English]

Confirmed -- both match standard.go.kr.

I am planning to split into two patches:
0001 -- Baseline capture [...]
0002 -- Actual fix [...]

The split is a nice shape for review -- characterization-first
makes the behavior delta easy to see, and "exactly one line moves"
is a useful audit signal. Whether to keep the split or squash at
commit time is a committer call.

Regards,
Henson Choi

SungJun Jang

sjjang112233@gmail.com

2 months ago

In reply to: Henson Choi (#6)

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Hi Henson,

Thanks for the thorough review notes and the detailed coverage list.

I plan to work through them myself gradually, as time
allows, and share what I find along the way; other eyes would
be welcome.

Agreed. I will follow the same approach -- work through the list
gradually and share findings as I go.

On the release-19.sgml hunk: release
notes are curated by the release manager from commit messages
after the fact, not written into the patch. It would probably
be cleaner for v3 to omit that hunk and let the commit message
carry the user-visible change instead.

Thanks for the clarification.. v3 attached: release-19.sgml hunk
removed. The commit message now carries the user-visible change:

"The only user-visible effect is that pg_encoding_max_length('EUC_KR')
now returns 2 instead of 3."

The charset.sgml Table 23.3 fix is retained.

Regards,
SungJun Jang

2026년 5월 14일 (목) 오전 10:56, Henson Choi <assam258@gmail.com>님이 작성:

Show quoted text

Hi Michael, SungJun,

One framing point first.

Correctness of the rewritten routines is the easy part -- for valid
EUC-KR data there is no behavior change. What deserves scrutiny is
the side-effect surface: callers that have been getting mblen=2 or 3
for 0x8E/0x8F, and the fact that such bytes can already be on disk
via routes that skip pg_verify_mbstr. After the patch those rows
are measured differently by length/substring/LIKE.

Suggested review principles:

(P1) Indirect-entry completeness -- pg_wchar_table[] should be
confirmed as the only callback registry into the EUC-KR slots.
(P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
pg_encoding_max_length and their cstring variants should be
enumerated and each call site checked for any reliance on
the old mblen=2/3 return for 0x8E/0x8F.
(P3) Verify-bypass:
(a) call-time -- code that runs mblen/dsplen on bytes not
verified in the same call chain (copyto.c file_encoding
scan, cstring pg_mblen in formatting.c / varlena.c) should
be identified.
(b) load-time -- byte-preserving carry-over that skips
pg_verify_mbstr (pg_upgrade relfilenode preservation,
physical replication / WAL replay, pre-patch on-disk data
from earlier versions) should be reasoned about.
(P4) Cross-encoding contamination -- it should be confirmed that no
Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
data after the split.
(P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
from 3 to 2; buffer-sizing / width-bounded consumers should be
checked.

Review-coverage list (not a test matrix):

Encoding-layer surface (P5)
- pg_wchar_table[PG_EUC_KR].maxmblen and readers of
pg_encoding_max_length() that switch on EUC-KR width
- pg_database_encoding_max_length() callers in mbutils.c

Direct callers of pg_mblen / pg_dsplen (P2)
- varlena.c: text_substring, text_position, text_left/right,
text_reverse, length-family
- oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
- formatting.c: to_char width handling (pg_mblen_cstr)
- varchar.c: bpchar / varchar length enforcement
(pg_mbcharcliplen)
- xml.c, jsonfuncs.c: pg_mblen on textual content

Pattern matching (P2)
- like.c: multibyte LIKE

Call-time verify-bypass (P3a)
- copyto.c: file_encoding delimiter scan
- formatting.c / varlena.c: cstring pg_mblen callers

Load-time verify-bypass (P3b) -- highest follow-up priority
- pg_upgrade: relfilenode preservation (bytes as-is)
- Physical replication / WAL replay
- Pre-patch on-disk data from earlier versions

Pre-existing-data behavior (P3b consequence)
- length/char_length/octet_length, substring/left/right/position,
LIKE / regex on rows containing 0x8E/0x8F
- Sort order / btree text indexes built before the patch
(bytewise comparator, likely safe but worth naming)

tsearch (P2)
- spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c

Conversion proc sanity
- utf8_and_euc_kr (table-driven, expected unchanged)

These are the items I think are worth a line or two of reasoning
from someone who has read the code, though I don't think they
need to block this patch -- treating the review-coverage list as
background and following up separately is a perfectly fine
outcome. I plan to work through them myself gradually, as time
allows, and share what I find along the way; other eyes would
be welcome.

----------------------------------------------------------------------

On 2026-05-13, Michael Paquier wrote:

I unfortunately cannot read the PNG file [...]

I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
transcription matches the normative text verbatim.

(On src/test/locale/: agreed with SungJun -- a separate concern
from this patch.)

- the proposed patch has zero regression tests.

Agreed on the coverage gap. SungJun's proposed 0001 baseline
tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
handling, SS2 rejection) are a sensible starting point and worth
adopting as-is. Beyond that, the review-coverage list above is
what I plan to walk as follow-up: some items will naturally
suggest further regression tests, while others (notably the
load-time bypass paths in P3b) can only be reasoned about in
review. Any additional tests can come out of that pass.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (v2):

v2 attached. [commit message and release note updated]

Commit-message update LGTM. On the release-19.sgml hunk: release
notes are curated by the release manager from commit messages
after the fact, not written into the patch. It would probably
be cleaner for v3 to omit that hunk and let the commit message
carry the user-visible change instead (pg_encoding_max_length
3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
P3b routes). The charset.sgml Table 23.3 fix should stay --
that is documented behavior, not a release note.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (reply to Michael):

[KS X 2901 clause text quoted in Korean and English]

Confirmed -- both match standard.go.kr.

I am planning to split into two patches:
0001 -- Baseline capture [...]
0002 -- Actual fix [...]

The split is a nice shape for review -- characterization-first
makes the behavior delta easy to see, and "exactly one line moves"
is a useful audit signal. Whether to keep the split or squash at
commit time is a committer call.

Regards,
Henson Choi

Remove invalid SS2/SS3 handling from EUC-KR routines

Attachments:

Attachments:

Attachments: