BUG #19354: JOHAB rejects valid byte sequences

Started by PG Bug reporting form6 months ago22 messagesbugs

noreply@postgresql.org

6 months ago

The following bug has been logged on the website:

Bug reference: 19354
Logged by: Jeroen Vermeulen
Email address: jtvjtv@gmail.com
PostgreSQL version: 18.1
Operating system: Debian unstable x86-64, macOS, Windows, etc.
Description:

Calling libpq, connecting to a UTF8 database and successfully setting client
encoding to JOHAB, this statement:

PQexec(connection, "SELECT '\x8a\x5c'");

Returned an empty result with this error message:

ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c

AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
Easily verified in Python:

print(b'\x8a\x5c'.decode('johab'))

It's the same story for some other valid sequences I tried, including this
character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My test code did work with similar two-byte characters in BIG5, GB18030,
UTF-8, SJIS, and UHC. It just breaks with these JOHAB characters on all of
these x86-64 docker images: "archlinux", "debian", "debian:unstable",
"fedora", and "ubuntu". And I got the same results on macOS+homebrew,
Windows+MinGW with pacman-installed postgres, and a native Windows VM with
whatever-postgres-they-preinstall.

Robert Haas

robertmhaas@gmail.com

6 months ago

In reply to: PG Bug reporting form (#1)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

Calling libpq, connecting to a UTF8 database and successfully setting client
encoding to JOHAB, this statement:

PQexec(connection, "SELECT '\x8a\x5c'");

Returned an empty result with this error message:

ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c

AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character "굎".
Easily verified in Python:

print(b'\x8a\x5c'.decode('johab'))

It's the same story for some other valid sequences I tried, including this
character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

--
Robert Haas
EDB: http://www.enterprisedb.com

Jeroen Vermeulen

jtvjtv@gmail.com

6 months ago

In reply to: Robert Haas (#2)

Re: BUG #19354: JOHAB rejects valid byte sequences

Hi Robert. Thanks for following up.

The original author of the support code in libpqxx also noted that there
was a discrepancy. Python does accept these 2-byte sequences, and decodes
them to Hangul characters.

The way I read the Wikipedia section, Johab isn't like the EUC encodings in
that it adds characters that contain ASCII-like values in the second byte.
I guess that was needed to support Chinese characters in addition to
Hangul. Unit-testing for the embedded-backslash hazard was what led me to
find the problem.

This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
such as an EBCDIC-based one and a stateful one!

Jeroen

On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

Calling libpq, connecting to a UTF8 database and successfully setting

client

encoding to JOHAB, this statement:

PQexec(connection, "SELECT '\x8a\x5c'");

Returned an empty result with this error message:

ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c

AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character

"굎".

Easily verified in Python:

print(b'\x8a\x5c'.decode('johab'))

It's the same story for some other valid sequences I tried, including

this

character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Jeroen Vermeulen (#3)

Re: BUG #19354: JOHAB rejects valid byte sequences

Jeroen Vermeulen <jtvjtv@gmail.com> writes:

This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
such as an EBCDIC-based one and a stateful one!

Yeah. So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

regards, tom lane

VASUKI M

vasukianand0119@gmail.com

6 months ago

In reply to: Tom Lane (#4)

Re: BUG #19354: JOHAB rejects valid byte sequences

Thanks all,That analysis makes a lot of sense.

Given the lack of a clear spec,the existence of multiple JOHAB variants,and
how long this has apparently been "working" without anyone noticing,IMHO
desupporting it does seem like the least risky option.At this point,trying
to fix JOHAB variants feels like opening a pretty big can of
worms,especially with the potential for dump/reload surprises or subtle
parsing/security issues.

I don't have additional data to add,but +1 on removal or deprecation being
a reasonable outcome here,given how obscure and effectively dead the
encoding is nowadays.

Thanks for digging into this.

Cheers,
Vasuki M

On Tue, Dec 16, 2025 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Jeroen Vermeulen <jtvjtv@gmail.com> writes:

This bit worries me: "TlOther, vendor-defined, Johab variants also

exist" —

such as an EBCDIC-based one and a stateful one!

Yeah. So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

regards, tom lane

Jeroen Vermeulen

jtvjtv@gmail.com

6 months ago

In reply to: VASUKI M (#5)

Re: BUG #19354: JOHAB rejects valid byte sequences

My one worry is perhaps Johab is on the list because one important user
needed it.

But even then that requirement may have gone away?

Jeroen

On Tue, Dec 16, 2025, 07:23 VASUKI M <vasukianand0119@gmail.com> wrote:

Show quoted text

Thanks all,That analysis makes a lot of sense.

Given the lack of a clear spec,the existence of multiple JOHAB
variants,and how long this has apparently been "working" without anyone
noticing,IMHO desupporting it does seem like the least risky option.At this
point,trying to fix JOHAB variants feels like opening a pretty big can of
worms,especially with the potential for dump/reload surprises or subtle
parsing/security issues.

I don't have additional data to add,but +1 on removal or deprecation being
a reasonable outcome here,given how obscure and effectively dead the
encoding is nowadays.

Thanks for digging into this.

Cheers,
Vasuki M

On Tue, Dec 16, 2025 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeroen Vermeulen <jtvjtv@gmail.com> writes:

This bit worries me: "TlOther, vendor-defined, Johab variants also

exist" —

such as an EBCDIC-based one and a stateful one!

Yeah. So what we have here is:

1. Our JOHAB implementation has apparently been wrong since day one.

2. Wrongness may be in the eye of the beholder, since there are
multiple versions of JOHAB.

3. Your complaint is the first, AFAIR.

4. That wikipedia page says "Following the introduction of Unified
Hangul Code by Microsoft in Windows 95, and Hangul Word Processor
abandoning Johab in favour of Unicode in 2000, Johab ceased to be
commonly used."

Given these things, I wonder if we shouldn't desupport JOHAB
rather than attempt to fix it. Fixing would likely be a significant
amount of work: if we don't even have the character lengths right,
how likely is it that our conversions to other character sets are
correct? I also worry that if different PG versions have different
ideas of the mapping, there could be room for dump/reload problems,
and maybe even security problems related to the backslash issue.

regards, tom lane

Robert Haas

robertmhaas@gmail.com

6 months ago

In reply to: Jeroen Vermeulen (#6)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Tue, Dec 16, 2025 at 2:42 AM Jeroen Vermeulen <jtvjtv@gmail.com> wrote:

My one worry is perhaps Johab is on the list because one important user needed it.

But even then that requirement may have gone away?

Well, that was over 20 years ago. There's a very good chance that even
if somebody was using JOHAB back then, they're not still using it now.

What's mystifying to me is that, presumably, somebody had a reason at
the time for thinking that this was correct. I know that our quality
standards were a whole looser back then, but I still don't quite
understand why someone would have spent time and effort writing code
based on a purely fictitious encoding scheme. So I went looking for
where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
from a file JOHAB.TXT, of which the latest version seems to be found
here:

https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT

And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
regenerates the current mapping files. Playing with it a bit:

rhaas=# select convert_from(e'\\x8a5c'::bytea, 'johab');
ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
rhaas=# select convert_from(e'\\x8444'::bytea, 'johab');
ERROR: invalid byte sequence for encoding "JOHAB": 0x84 0x44
rhaas=# select convert_from(e'\\x89ef'::bytea, 'johab');
convert_from
--------------
괦
(1 row)

So, \x8a5c is the original example, which does appear in JOHAB.TXT,
and \x8444 is the first multi-byte character in that file, and both of
them fail. But 89ef, which also appears in that file, doesn't fail,
and from what I can tell the mapping is correct. So apparently we've
got the "right" mappings, but you can only actually the ones that
match the code's rules for something to be a valid multi-byte
character, which aren't actually in sync with the mapping table. I'm
left with the conclusions that (1) nobody ever actually tried using
this encoding for anything real until 3 days ago and (2) we don't have
any testing infrastructure that verifies that the characters in the
mapping tables are actually accepted by pg_verifymbstr(). I wonder how
many other encodings we have that don't actually work?

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Robert Haas (#7)

Re: BUG #19354: JOHAB rejects valid byte sequences

Robert Haas <robertmhaas@gmail.com> writes:

... So I went looking for
where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
from a file JOHAB.TXT, of which the latest version seems to be found
here:
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
regenerates the current mapping files.

Thanks for doing that research!

So apparently we've
got the "right" mappings, but you can only actually the ones that
match the code's rules for something to be a valid multi-byte
character, which aren't actually in sync with the mapping table.

Yeah. Looking at the code in wchar.c, it's clear that it thinks
that JOHAB has the same character-length rules as EUC_KR, which is
something that one might guess based on available documentation that
says it's related to that encoding. So I can see how we got here.

However, that doesn't mean we can fix pg_johab_mblen() and we're done.
I'm still quite afraid that we'd be introducing security-grade
inconsistencies of interpretation between different PG versions.

I'm
left with the conclusions that (1) nobody ever actually tried using
this encoding for anything real until 3 days ago and (2) we don't have
any testing infrastructure that verifies that the characters in the
mapping tables are actually accepted by pg_verifymbstr(). I wonder how
many other encodings we have that don't actually work?

Indeed. Anyone want to do some testing?

regards, tom lane

Robert Haas

robertmhaas@gmail.com

6 months ago

In reply to: Tom Lane (#8)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Tue, Dec 16, 2025 at 10:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, that doesn't mean we can fix pg_johab_mblen() and we're done.
I'm still quite afraid that we'd be introducing security-grade
inconsistencies of interpretation between different PG versions.

I understand that fear, but I do not have an opinion either way on
whether there would be an actual vulnerability

I think there is a good chance that the right going-forward fix is to
deprecate the encoding, because according to
https://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt this and
everything else that's now under
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ were
deprecated in 2001. By the time v19 is released, the deprecation will
be a quarter-century old, and the fact that it doesn't work is good
evidence that few people will miss it, though perhaps the original
poster will want to put forward an argument for why we should still
care about this.

What to do in the back branches is a more difficult question. Since
this is a client-only encoding, there's no issue of what is already
stored in the database, and we would not be proposing to change any of
the mappings, just allow the ones that don't currently work to do so.
I *think* that fixing pg_johab_mblen() would be "forward compatible":
the subset of the encoding that already works would continue to behave
in the same way, and the rest of it would begin working as well.

And, I don't really like throwing up our hands and deciding that
already-released features are free to continue not working. That's
what bug-fix release are for.

On the other hand, fixing this bug which apparently affects very few
users, and in the process creating a scarier, CVE-worthy bug would not
win us many friends, especially in view of the apparently-low uptake
of this encoding.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10

Michael Paquier

michael@paquier.xyz

6 months ago

In reply to: Tom Lane (#8)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Tue, Dec 16, 2025 at 10:41:46AM -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I'm
left with the conclusions that (1) nobody ever actually tried using
this encoding for anything real until 3 days ago and (2) we don't have
any testing infrastructure that verifies that the characters in the
mapping tables are actually accepted by pg_verifymbstr(). I wonder how
many other encodings we have that don't actually work?

Indeed. Anyone want to do some testing?

FWIW, I have been made aware a couple of weeks ago by a colleague that
SJIS and SHIFT_JIS_2004 are used by some customers, and that we are
many years behind an update of the conversion mappings in the tree
with Postgres not understanding some of the characters. These are two
marginal in the mostly-UTF8 world we live in these days, but it's
annoying for byte sequences that should not change across the years,
just be refreshed with new data.
--
Michael

#11

Thomas Munro

thomas.munro@gmail.com

2 months ago

In reply to: Robert Haas (#9)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Wed, Dec 17, 2025 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think there is a good chance that the right going-forward fix is to
deprecate the encoding, because according to
https://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt this and
everything else that's now under
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ were
deprecated in 2001. By the time v19 is released, the deprecation will
be a quarter-century old, and the fact that it doesn't work is good
evidence that few people will miss it, though perhaps the original
poster will want to put forward an argument for why we should still
care about this.

Right, that stuff was withdrawn, along with the BIG5 and JIS X 0212
mappings (here's some interesting discussion about their normative
status[1]https://unicode.org/mail-arch/unicode-ml/y2002-m03/0691.html). From what I can figure out, JOHAB was an MS-DOS codepage
(1361), obsoleted by UHC (949) some time around MS-DOS 6.22 or MS-DOS
7 and Windows 95.

So +1 from me, set the phasers to git rm. Based on the comments for
enum pg_enc, we don't need to worry about numerical stability of
client-only encodings, so I just deleted it (unlike PG_MULE_INTERNAL
which became PG_UNUSED_1). I didn't mention it in
doc/src/sgml/appendix-obsolete.sgml: the decision criterion for that
seems to be that there was an SGML id that appeared in a URL, which is
not the case here. The release notes seem like enough of a tombstone
for something that we strongly suspect has 0 users. Wait until 20, or
just do it now?

I don't have an opinion yet whether the code in the back-branches
might be dangerous, or "fixing" it might be more dangerous, but it's
an interesting question...

[1]: https://unicode.org/mail-arch/unicode-ml/y2002-m03/0691.html

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

2 months ago

In reply to: Thomas Munro (#11)

Re: BUG #19354: JOHAB rejects valid byte sequences

On 14/04/2026 09:30, Thomas Munro wrote:

On Wed, Dec 17, 2025 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think there is a good chance that the right going-forward fix is to
deprecate the encoding, because according to
https://www.unicode.org/Public/MAPPINGS/EASTASIA/ReadMe.txt this and
everything else that's now under
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/ were
deprecated in 2001. By the time v19 is released, the deprecation will
be a quarter-century old, and the fact that it doesn't work is good
evidence that few people will miss it, though perhaps the original
poster will want to put forward an argument for why we should still
care about this.

Right, that stuff was withdrawn, along with the BIG5 and JIS X 0212
mappings (here's some interesting discussion about their normative
status[1]). From what I can figure out, JOHAB was an MS-DOS codepage
(1361), obsoleted by UHC (949) some time around MS-DOS 6.22 or MS-DOS
7 and Windows 95.

So +1 from me, set the phasers to git rm.

Based on the comments for enum pg_enc, we don't need to worry about
numerical stability of client-only encodings, so I just deleted it
(unlike PG_MULE_INTERNAL which became PG_UNUSED_1).

Ok. I hope there are no 8.2-era client programs out there that are still
abusing pg_wchar.h. I think we're good, but we've never really exercised
the strategy that was laid out in commit 8468146b03c8.

Wait until 20, or just do it now?

Let's just do it now.
- Heikki

#13

Henson Choi

assam258@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#12)

Re: BUG #19354: JOHAB rejects valid byte sequences

Hi all,

I'd like to ask for a brief hold on the removal before the patch
is committed.

JOHAB is formally specified in KS C 5601-1992 Annex 3 (later
renumbered KS X 1001:1992), a Korean national standard published
by the Korean Agency for Technology and Standards (KATS). This
document is written in Korean, and I believe a native Korean
speaker reviewing the original standard directly is the most
reliable way to verify the correct byte ranges and resolve the
ambiguity that has been mentioned in this thread.

I am currently looking into the original standard document and
will follow up with a precise specification and a fix patch
shortly.

One point worth considering before removal: JOHAB may be obsolete
for new systems, but legacy Korean data from the 1990s does exist
in old archives, government records, and university systems. As a
client encoding, JOHAB support in PostgreSQL provides a useful
migration path for loading such data. Removing it closes that door
permanently with no way back.

I am Korean, and I would like to take responsibility for verifying
the standard and submitting a proper fix.

Regards,
Henson

#14

Henson Choi

assam258@gmail.com

2 months ago

In reply to: Heikki Linnakangas (#12)

Re: BUG #19354: JOHAB rejects valid byte sequences

Hi hackers,

So +1 from me, set the phasers to git rm.

+1

Wait until 20, or just do it now?

Let's just do it now.

Following up on my earlier note with an actual review of the primary
Korean national standard and a fix patch. The result turns out to be
small, and I believe it resolves the ambiguity that drove the removal
proposal.

Standard reference
------------------

The authoritative specification for JOHAB is Annex 3 of KS X 1001
(originally KS C 5601-1992 Annex 3, renumbered KS X 1001:1992 and
republished as KS X 1001:2004), published by the Korean Agency for
Technology and Standards (KATS) and available from the national
e-standards portal:

https://standard.go.kr/KSCI/api/std/viewMachine.do?reformNo=08&tmprKsNo=KSX1001&formType=STD

The decisive passages are quoted below in the original Korean with an
English translation, so non-Korean readers can verify the byte ranges
the fix implements.

Two terms from the standard recur throughout the quoted passages:

* 완성형 부호계 (romanised "WANSUNG", literally "completion-form
code set"). Each Hangul syllable is assigned a single code point
drawn from a fixed table of pre-composed syllables. The main
body of KS X 1001 defines such a table of 2,350 syllables; per
the standard's commentary, that subset was chosen by frequency
analysis over samples from publishing, print media, industry,
academia and dictionaries at the time of the 1987 revision,
which is why some valid modern syllables (e.g. 뢔, 쌰, 쎼, 쓔,
쬬) were deliberately excluded. EUC-KR is the packed 8-bit form
of that WANSUNG table, and Microsoft's CP949 / UHC is a later
superset that fills in additional syllables.

* 조합형 부호계 (romanised "JOHAB", literally "combinational code
set"). Each Hangul syllable is constructed at encoding time
from 5-bit codes for the initial consonant, medial vowel, and
final consonant packed into two bytes, so all 11,172 modern
syllables are directly representable without a lookup table.
This is what Annex 3 defines and what PostgreSQL ships under
the encoding name JOHAB.

In short: completion form is a frequency-curated lookup, combinational
form is an algorithmic composition that covers the full modern Hangul
space. Unicode later adopted the combinational form's coverage as a
completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
encodes exactly the same 11,172 modern syllables, as precomposed code
points. So today the three Korean-related encodings PostgreSQL
supports sit along this spectrum: EUC_KR (curated completion form),
UHC (extended completion form), and JOHAB (algorithmic combinational
form).

부속서 3 보조 부호계 (2바이트 조합형 부호계)
[Annex 3. Supplementary code set (two-byte combinational code)]

1. 적용 범위
[Scope]

이 부속서에서는 기본 부호계인 2바이트 완성형 부호계의 보조 부호계로서,
2바이트 조합형 부호계를 규정한다.
[This annex specifies the two-byte combinational code set as the
supplementary code set to the two-byte completion-form code set that
constitutes the main body of the standard.]

2. 도형 문자
[Graphic characters]

a) 한 글
[Hangul]
부속서 3 표 2에 규정된 첫소리 글자 19자, 가운뎃소리 글자 21자,
끝소리 글자 27자로 조합 가능한, 모든 현대 한글 글자 마디(11 172자)
및 현대 한글 낱자(67자)
[All modern Hangul syllables (11,172) and modern Hangul jamo (67)
that can be composed from the 19 initials, 21 medials, and 27
finals defined in Annex 3 Table 2.]
b) 한 자
[Hanja]
2바이트 완성형 부호계에서 규정한 한자(4 888자)
[The 4,888 Hanja defined in the two-byte completion-form code
set.]
c) 그 밖의 문자
[Other characters]
2바이트 완성형 부호계에서 규정한 문자 중에서 현대 한글 글자 마디
및 현대 한글 낱자, 한자를 제외한 도형 문자(937자)
[The 937 graphic characters defined in the completion-form code
set other than modern Hangul syllables, modern Hangul jamo, and
Hanja.]

3. 도형 문자의 배치 영역
[Graphic-character placement]

도형 문자의 배치 영역은 부속서 3 표 1과 같다.
[The placement of the graphic characters is given in Annex 3
Table 1.]

부속서 3 표 1 도형 문자의 배치 영역
[Annex 3 Table 1. Placement of graphic characters]

구 분 첫째 바이트 둘째 바이트
[Category] [Lead byte] [Trail byte]
---------------- ----------- --------------------
한글 글자마디 84H–D3H 41H–7EH, 81H–FEH
[Hangul syllables]
사용자 정의 영역 D8H 31H–7EH, 91H–FEH
[User-defined area]
기타 문자 D9H–DEH 31H–7EH, 91H–FEH
[Other characters]
한 자 E0H–F9H 31H–7EH, 91H–FEH
[Hanja]

비 고 16진수를 나타내기 위하여 맨 뒤에 H를 적는다
(10 H는 10진법으로 16이다).
[Note: a trailing H denotes a hexadecimal value
(e.g. 10H equals 16 in decimal).]

4. 한글 글자 마디의 부호값 구성 및 배열
[Encoding and layout of Hangul syllables]

각 한글 글자 마디의 부호값은 2바이트 내에 첫소리 글자 5비트,
가운뎃소리 글자 5비트, 끝소리 글자 5비트로 하여, 한글 낱자를 조합하여
표현한 값으로 정의한다. 각 한글 낱자의 순서는 최상위 비트(MSB)를 1로
하고 나서 첫소리, 가운뎃소리, 끝소리 글자가 순서대로 나오도록
구성한다.
[The code value of each Hangul syllable is defined as the composition
of the Hangul letters within two bytes: 5 bits for the initial
consonant, 5 bits for the medial vowel, and 5 bits for the final
consonant, laid out with the most-significant bit set to 1 followed
by the initial, medial, and final in that order.]

Annex 3 continues with Table 2 (5-bit jamo codes), Table 3 (row-wise
mapping between completion-form and combinational-form for Hanja and
other characters), and usage notes. Those are not needed for the
verifier fix, but they do confirm that the mapping tables we already
ship in johab_to_utf8.map line up with the standard; the same is true
of the data under unicode.org's JOHAB.TXT that Robert pointed to
earlier in the thread.

On "multiple variants": the KS national standard for JOHAB (Annex 3)
is singular and authoritative, and the mapping tables we ship match
it. The Wikipedia note about EBCDIC-based and stateful JOHAB variants
refers to niche vendor encodings that PostgreSQL never implemented.

The historical "variant" churn in Korean encoding is in fact not about
JOHAB but about the completion-form main body of KS X 1001 and its
packed form EUC-KR: Microsoft's CP949 / UHC extended WANSUNG with
additional Hangul syllables, and different vendors disagreed at the
edges. PostgreSQL already separates those concerns by carrying
EUC_KR and UHC as distinct encodings, so fixing JOHAB does not
re-open that family of ambiguities.

Diagnosis
---------

pg_johab_mblen() in src/common/wchar.c delegates to pg_euc_mblen(),
whose relevant branches treat 0x8F (EUC's SS3) as a 3-byte prefix and
any other high-bit byte as a 2-byte prefix. pg_johab_verifychar()
then requires each trail byte to satisfy IS_EUC_RANGE_VALID(), defined
in the same file as ((c) >= 0xa1 && (c) <= 0xfe). Neither rule
corresponds to the standard:

* JOHAB has no three-byte sequences. 0x8F is simply a valid Hangul
lead byte (it lies in the 0x84-0xD3 Hangul syllable range from
Table 1) that begins a normal 2-byte sequence; EUC's SS3 handling
spuriously inflates its length to 3.
* Hangul trail bytes are 0x41-0x7E or 0x81-0xFE; the other three
categories use 0x31-0x7E or 0x91-0xFE. Restricting trail bytes to
0xA1-0xFE rejects large portions of the standard, including the
sequences in the bug report. 0x5C (ASCII backslash) is a valid
Hangul trail byte, which is exactly what Jeroen's unit test
surfaced.

The consequence is that a substantial portion of johab_to_utf8.map is
unreachable today: the verifier rejects the byte sequences before
conversion is attempted. That matches Robert's observation that the
"right" mapping existed but was gated behind an incorrect rule.

Patch
-----

The attached 0001-Fix-JOHAB-encoding-validation.txt makes these
changes:

src/common/wchar.c
Rewrite pg_johab_mblen() to return 2 when the lead byte falls in
any of the ranges listed in Annex 3 Table 1, and 1 otherwise
(ASCII pass-through). Rewrite pg_johab_verifychar() to apply the
correct trail-byte range depending on whether the lead byte is a
Hangul lead byte (trail 0x41-0x7E or 0x81-0xFE) or a non-Hangul
lead byte (trail 0x31-0x7E or 0x91-0xFE). Two helper macros
IS_JOHAB_LEAD_HANGUL() and IS_JOHAB_LEAD_OTHER() express the
lead-byte classification once and are shared between mblen and
verifychar. A comment block above the implementation reproduces
Table 1 for future maintainers. Also correct
pg_wchar_table[PG_JOHAB].maxmblen from 3 to 2 so that callers
sizing buffers from maxmblen do not over-allocate and so that the
value matches the spec.

doc/src/sgml/charset.sgml
Update the JOHAB row in the character-set table to show the
maximum character length as 1-2 instead of 1-3, matching the
standard and the corrected maxmblen.

src/test/regress/sql/johab.sql
src/test/regress/expected/johab.out
src/test/regress/expected/johab_1.out
src/test/regress/parallel_schedule
A new regression test, modelled on euc_kr.sql, that runs in UTF8
databases and skips otherwise. It covers:

- the original bug sequences \x8A\x5B, \x8A\x5C, \x8A\x5D
decoding to 굍, 굎, 굏;
- the first multibyte character from JOHAB.TXT (\x84\x44 -> ㄳ),
previously rejected;
- byte sequences that already decoded under the old rules
(\x89\xEF -> 괦, \x89\xA1 -> 고) to guard against regression;
- Hanja trail bytes that used to be rejected (\xE0\x31,
\xE0\x7E, \xE0\x91);
- one representative of the "other characters" category
(\xD9\x31);
- each lead-byte gap (0x80, 0xD5, 0xDF, 0xFA) producing an
"invalid byte sequence" error;
- every trail-byte gap for both Hangul (0x40, 0x7F, 0x80) and
the non-Hangul categories (0x30, 0x7F, 0x90, 0xFF);
- an incomplete trailing byte for a valid lead byte.

Compatibility
-------------

The mapping tables themselves are unchanged. Byte sequences that
decode successfully today continue to decode to the same characters;
the change is strictly additive in that previously-rejected sequences
now succeed. Because JOHAB is a client-only encoding there is no
on-disk representation to reconcile, so back-branch behaviour would
move from a strict subset of valid JOHAB to full valid JOHAB, without
reinterpreting any byte sequence that was previously accepted. I
believe that is safe to back-patch, but confining the change to v19
is also entirely reasonable if the project prefers to limit the
exposure.

Why keep it rather than remove it
---------------------------------

I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work. However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8). Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL. Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.

Happy to iterate on the patch, the commit message, or the tests.
Thanks to everyone for the careful analysis that preceded this; I
recognise that the consensus was leaning toward removal, and I would
appreciate a chance to have this fix considered as an alternative.

Regards,
Henson

#15

Thomas Munro

thomas.munro@gmail.com

2 months ago

In reply to: Henson Choi (#14)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Wed, Apr 15, 2026 at 1:20 PM Henson Choi <assam258@gmail.com> wrote:

In short: completion form is a frequency-curated lookup, combinational
form is an algorithmic composition that covers the full modern Hangul
space. Unicode later adopted the combinational form's coverage as a
completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
encodes exactly the same 11,172 modern syllables, as precomposed code
points. So today the three Korean-related encodings PostgreSQL
supports sit along this spectrum: EUC_KR (curated completion form),
UHC (extended completion form), and JOHAB (algorithmic combinational
form).

Thank you! Yes, that makes total sense. Here are my own notes
(compiled from English-language Wikipedia articles), which say
essentially the same thing + some notes about Hancom:

The Korean writing system:
1. Hanja: Chinese characters used in names, legal and historical
documents, and to disambiguate homonyms. The number of characters in
use is difficult to pin down (as in Japan and China).
2. Hangul: a phonetic system used for almost all modern Korean text.
Hangul characters are composed of 2-5 "jamo", commonly 2-3 in modern
texts, each representing a consonant/vowel.

Character set standards:
1. KS X 1001: 4,888 Hanja (of the vast number of hard to count CJK
ideographs) + 2,350 precomposed Hangul (of 11,172 theoretically
possible jamo combinations).
2. KS X 1002: added some more but no one ever implemented it,
possibly because...
3. Unicode: all 11,172 possible precomposed Hangul + individual jamo
for composition + all Hanja/Kanji/Hanzi characters known to humanity
(still growing).

Encodings:
1. EUR-KR, AKA Wansung (= "precomposed"): directly encoded KS X 1001.
2. JOHAB (= "combining"): deferred to KS X 1001 for Hanja, but
described all possible Hangul as jamo stored in bitfields.
3. UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
as a base but supplied all possible pre-composed Hangul and 8,222
Hanja (complete CJK as of Unicode 2.0).
4. UTF-8, UTF-16, UTF-32: Unicode.

Realpolitik that fed back into standards:
1. The Hancom "Hangul" word processor used de facto standard JOHAB
encoding, and dominated.
2. KS X 1001 recognised this and added that annex.
3. MS-DOS/Windows recognised this and called it CP1361.
4. MS-DOS/Windows switched to UHC/CP949 alongside Unicode some time
in the early to mid 90s.
5. Hancom switched to Unicode around the turn of the millennium.

I will study your patch and your analysis. It looks good on first read.

Why keep it rather than remove it
---------------------------------

I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work. However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8). Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL. Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.

The counter argument would be that you could use iconv
--from-code=JOHAB ..., or libiconv, or the codecs available in Python,
Java, etc for dealing with historical archived data, something that
data archivists must be very aware of. And for old Hancom word
processor files, not really of relevance to PostgreSQL, apparently
they can be imported by modern word processors.

Happy to iterate on the patch, the commit message, or the tests.
Thanks to everyone for the careful analysis that preceded this; I
recognise that the consensus was leaning toward removal, and I would
appreciate a chance to have this fix considered as an alternative.

Cool. For now I'll leave the removal on ice, and look into committing
your patch. Thanks for working on it!

#16

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Thomas Munro (#15)

Re: BUG #19354: JOHAB rejects valid byte sequences

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Apr 15, 2026 at 1:20 PM Henson Choi <assam258@gmail.com> wrote:

I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work. However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8). Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL. Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.

The counter argument would be that you could use iconv
--from-code=JOHAB ..., or libiconv, or the codecs available in Python,
Java, etc for dealing with historical archived data, something that
data archivists must be very aware of.

Sure. But it's not comfortable to remove a user-visible feature
we've had for decades. My own primary concern about it was that a
correct fix could require non-backwards-compatible behavior changes.
Henson's analysis says that that's not a problem. So assuming this
patch withstands review, I'd be much happier to see it applied than
to remove JOHAB.

No opinion at the moment about whether to back-patch.

regards, tom lane

#17

Henson Choi

assam258@gmail.com

2 months ago

In reply to: Thomas Munro (#15)

Re: BUG #19354: JOHAB rejects valid byte sequences

3. UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
as a base but supplied all possible pre-composed Hangul and 8,222
Hanja (complete CJK as of Unicode 2.0).

Small correction: UHC's additions over EUC-KR are on the Hangul side,
not Hanja. UHC adds 8,822 pre-composed Hangul (taking Hangul coverage
from EUC-KR's 2,350 up to the full 11,172) and leaves Hanja unchanged
at KS X 1001's 4,888. I enumerated all three encodings against
PostgreSQL's current conversion tables to double-check:

Encoding Hangul Hanja
EUC_KR 2,350 4,888
UHC 11,172 4,888
JOHAB 11,172 4,888 (after this patch)

"Complete CJK as of Unicode 2.0" is off too -- Unicode 2.0's CJK
Unified Ideographs block had roughly 20,900 characters, so UHC and
JOHAB both carry only the KS X 1001 Hanja subset. The 8,222 figure
looks like it got swapped with the 8,822 Hangul number.

Realpolitik that fed back into standards:

1. The Hancom "Hangul" word processor used de facto standard JOHAB

encoding, and dominated.
2. KS X 1001 recognised this and added that annex.

Minor nit on the sequence: KS C 5601 already had a combinational annex
in its 1982 revision, but with a different bit layout from the one
Hancom's word processor used. The 1992 revision swapped the annex's
bit layout to the commercial combinational form (상용 조합형) that
the industry -- Hancom included -- had popularised. The KS X
1001:2004 commentary documents this transition explicitly ("비트
조합을 널리 쓰고 있는 이른바 상용 조합형으로 바꿈"). So "KS
recognised the de facto standard" applies to 1992, not to the annex's
first appearance.

Worth mentioning for atmosphere: that period was the tail end of the
Apple II clone / MSX era and the rise of IBM PC compatibles in Korea,
and contemporary Korean computer magazines ran running debates over
Wansung vs Johab on three axes at once -- the encoding, the keyboard
layout (두벌식 vs 세벌식, the Korean QWERTY-vs-Dvorak argument), and
the font rendering strategy (per-syllable bitmap tables for Wansung
vs jamo-composition for Johab) -- right alongside their game reviews.
The 1992 annex revision landed in the middle of that churn, not
ahead of it.

One further observation that fits your KS X 1002 note. EUC-KR isn't
really a single standard but a layered stack -- KS X 1001 (the
character set) + ISO/IEC 2022 (the code-extension skeleton) + the
AT&T-era EUC convention of pinning G0 to ASCII and G1 to the 8-bit
region, later formalised in Korea as KS X 2901. That informal
layering is precisely what let UHC land so easily: Microsoft extended
the same 8-bit region with additional Hangul, and every EUC-KR
decoder silently kept working for the covered subset.

KS X 1002 tried the opposite approach -- a formally separated
supplementary set, designated via a distinct ISO-2022 escape
sequence. The design was cleaner on paper but required every
consumer to implement set-switching for a supplementary character
range that nobody was motivated to support. UHC sidestepped this
entirely by just filling in the unused 8-bit slots. So the
structural reason 1002 lost to UHC isn't just market power; it is
that UHC matched EUC-KR's informal extensibility while 1002 demanded
strict ISO-2022 compliance. JOHAB (Annex 3) sits at the other end of
that spectrum -- a self-contained spec where a single document nails
down character set, byte layout, and composition algorithm, which is
what makes the verifier fix tractable.

A small downstream consequence of UHC's slot-filling approach is that
byte-wise comparison no longer matches Korean dictionary order: the
8,822 added Hangul land in the low 0x81-0xA0 range, ahead of the
gananada-ordered EUC-KR region. Unicode's Hangul Syllables block
(U+AC00-U+D7A3) later restored that by assigning all 11,172 syllables
algorithmically in gananada order, so UTF-8 memcmp once again
produces Korean lexicographic order -- one of the quieter practical
drivers of Korea's Unicode migration.

Credit where it's due on that outcome: getting all 11,172 precomposed
Hangul into the BMP in algorithmic gananada order (the "Korean
Hangul Mess" cleanup in Unicode 2.0, 1996) wasn't inevitable.
Engineers at Microsoft's Korean office were notable advocates for
that arrangement alongside Korean standards-body contributors and
other vendors, and the Korean computing world has been quietly
benefiting from it ever since. It's a nice detail given who's
reading this thread.

Everything else in the summary matches what I had -- thanks for the
independent write-up, and for taking another look at the patch.

The counter argument would be that you could use iconv
--from-code=JOHAB ..., or libiconv, or the codecs available in Python,
Java, etc for dealing with historical archived data, something that
data archivists must be very aware of.

Sure. But it's not comfortable to remove a user-visible feature
we've had for decades. My own primary concern about it was that a
correct fix could require non-backwards-compatible behavior changes.
Henson's analysis says that that's not a problem. So assuming this
patch withstands review, I'd be much happier to see it applied than
to remove JOHAB.

Thank you -- the backward-compat angle was the hinge I was hoping
would carry, and I'm glad the analysis held up. On the size of the
remaining audience: niche Korean standards have a small but stubborn
user base, much the way Dvorak users persist in the West. There are
still 세벌식 (Sebeolsik) keyboard users in Korea who keep hand-cut
stickers over their QWERTY-printed keycaps rather than switch back;
the JOHAB data holdouts are that kind of tail -- vanishingly small in
absolute numbers, but without a graceful alternative if we close the
door. A correctly-working JOHAB serves that tail at near-zero
ongoing cost, which is ultimately what the patch is arguing for.

No opinion at the moment about whether to back-patch.

Happy to defer on back-patching. The behaviour change is strictly
additive (previously-rejected sequences start accepting, nothing is
reinterpreted), so the back-branches are technically safe, but v19-
only is a perfectly reasonable policy call if the project prefers
minimum surface area on the first cycle.

If you do want back-patches, I'm happy to produce per-branch
versions. Given how long the JOHAB code has been stable (as noted
earlier in the thread), my feeling is that the same patch should
apply cleanly down to PG 14 without modification. Happy to verify
that and post the set if it would help.

One personal aside: reading KS X 1001 Annex 3 end-to-end for this fix
turned out to be an unexpectedly cheerful detour -- it felt a bit
like cracking open a 6502 assembly reference from roughly the same
era. Back then I also had a popular neural-networks book that
convinced teenage-me computers would never approach human cognition
because they could never match the brain's memory scale -- a
prediction that, looking around in 2026, has aged about as well as
you'd expect. Thanks to everyone on the thread for making that
side-quest worthwhile.

Regards,
Henson

#18

Henson Choi

assam258@gmail.com

2 months ago

In reply to: Henson Choi (#17)

Re: BUG #19354: JOHAB rejects valid byte sequences

Subject: Fix and expand comments for Korean encodings in encnames.c

Hi hackers,

While reading through the encoding alias table in src/common/encnames.c,
I noticed a few long-standing inaccuracies and omissions in the per-entry
comments for the three Korean encodings.

The most visible issue is the JOHAB entry, whose comment describes it as
"Extended Unix Code for simplified Chinese" -- apparently a copy/paste
slip from a neighboring EUC entry. JOHAB is in fact the Korean
combining-style encoding defined in KS X 1001 annex 3.

The attached 0002 patch makes comment-only adjustments to the three
Korean encodings:

* JOHAB: replace the incorrect "simplified Chinese" description with
a correct one that identifies it as the Korean combining (Johab)
encoding standardized in KS X 1001 annex 3.

* EUC_KR: drop a stray space before the comma in the existing
comment, and note that the encoding covers the KS X 1001
precomposed (Wansung) form.

* UHC: spell out "Unified Hangul Code", clarify that it is
Microsoft Windows CodePage 949, and describe its relationship to
EUC-KR (superset covering all 11,172 precomposed Hangul syllables).

No behavior change, no catalog change, no pg_wchar.h change -- this
touches comments in src/common/encnames.c only. pgindent is clean.

Thanks,
Henson Choi

#19

Thomas Munro

thomas.munro@gmail.com

2 months ago

In reply to: Henson Choi (#18)

Re: BUG #19354: JOHAB rejects valid byte sequences

On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <assam258@gmail.com> wrote:

While reading through the encoding alias table in src/common/encnames.c,
I noticed a few long-standing inaccuracies and omissions in the per-entry
comments for the three Korean encodings.

LGTM, so I will go ahead and push this to all branches.

#20

Tatsuo Ishii

ishii@postgresql.org

2 months ago

In reply to: Henson Choi (#14)

Re: BUG #19354: JOHAB rejects valid byte sequences

Hi Henson,

Thank you for the patch!

Diagnosis
---------

pg_johab_mblen() in src/common/wchar.c delegates to pg_euc_mblen(),
whose relevant branches treat 0x8F (EUC's SS3) as a 3-byte prefix and
any other high-bit byte as a 2-byte prefix. pg_johab_verifychar()
then requires each trail byte to satisfy IS_EUC_RANGE_VALID(), defined
in the same file as ((c) >= 0xa1 && (c) <= 0xfe). Neither rule
corresponds to the standard:

* JOHAB has no three-byte sequences. 0x8F is simply a valid Hangul
lead byte (it lies in the 0x84-0xD3 Hangul syllable range from
Table 1) that begins a normal 2-byte sequence; EUC's SS3 handling
spuriously inflates its length to 3.
* Hangul trail bytes are 0x41-0x7E or 0x81-0xFE; the other three
categories use 0x31-0x7E or 0x91-0xFE. Restricting trail bytes to
0xA1-0xFE rejects large portions of the standard, including the
sequences in the bug report. 0x5C (ASCII backslash) is a valid
Hangul trail byte, which is exactly what Jeroen's unit test
surfaced.

From what he showd in the post, I think the analysis is correct.

Patch
-----

The attached 0001-Fix-JOHAB-encoding-validation.txt makes these
changes:

The patch looks good to me. Also reegression tests passed here.

Compatibility
-------------

The mapping tables themselves are unchanged. Byte sequences that
decode successfully today continue to decode to the same characters;
the change is strictly additive in that previously-rejected sequences
now succeed. Because JOHAB is a client-only encoding there is no
on-disk representation to reconcile, so back-branch behaviour would
move from a strict subset of valid JOHAB to full valid JOHAB, without
reinterpreting any byte sequence that was previously accepted. I
believe that is safe to back-patch, but confining the change to v19
is also entirely reasonable if the project prefers to limit the
exposure.

* Category Lead byte Trail byte
* -------------------- ----------- ---------------------
* Hangul syllables 0x84 - 0xD3 0x41 - 0x7E, 0x81 - 0xFE
* User-defined area A 0xD8 0x31 - 0x7E, 0x91 - 0xFE
* Other characters 0xD9 - 0xDE 0x31 - 0x7E, 0x91 - 0xFE
* Hanja 0xE0 - 0xF9 0x31 - 0x7E, 0x91 - 0xFE

Current JOHAB verify function accepts byte sequences falling into one
of these 3 categories (except ASCII):

(2-byte): SS2(0x8E) + 0xA1 - 0xDF
(2-byte): 0xA1 - 0xFE + 0xA1 - 0xFE
(3-byte): SS3(0x8F) + 0xA1 - 0xFE + 0xA1 - 0xFE

The 2-byte sequences fall into one of the JOHAB categories above. The
3-byte sequences may fall into one of the JOHAB categories if
subsequent (the 4th byte) is accidentally in ASCII range. Otherwise,
they will be rejected while converting to UTF-8 before storing data
into database.

Despite the fact that the current JOHAB verify function is wrong, all
byte sequences that have been already accepted are also in valid JOHAB
range, as Henson said. This means that existing UTF-8 database
populated with data client encoding being set to JOHAB can be safely
used after patching.

Why keep it rather than remove it
---------------------------------

I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work. However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8). Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL. Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.

+1.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

#21

Robert Haas

robertmhaas@gmail.com

about 1 month ago

In reply to: Thomas Munro (#19)

#22

Henson Choi

assam258@gmail.com

about 1 month ago

In reply to: Robert Haas (#21)