BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

Started by PG Bug reporting formabout 5 years ago5 messagesbugs

noreply@postgresql.org

about 5 years ago

The following bug has been logged on the website:

Bug reference: 16926
Logged by: Eirik Bakke
Email address: ebakke@ultorg.com
PostgreSQL version: 13.1
Operating system: Windows 10
Description:

On PostgreSQL 13.1 on US English Windows 10, "initdb" will fail with the
following error if the initdb.exe executable is located on a path that
contains certain non-ASCII characters, and "--encoding=UTF8" is specified.
In the following example, I am executing initdb.exe from a folder called
"C:\Users\ebakke\ZRoot\PostgresTest\FolderÆØÅ\pgsql\bin":

=================================================
C:\Users\ebakke\ZRoot\PostgresTest\FolderÆØÅ\pgsql\bin>initdb -D
C:\Users\ebakke\Deletables\pgtest --encoding=UTF8 --locale=en_US
The files belonging to this database system will be owned by user
"ebakke".
This user must also own the server process.

The database cluster will be initialized with locale "en_US".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory C:/Users/ebakke/Deletables/pgtest ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... windows
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... US/Eastern
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... 2021-03-13 19:23:58.206 EST
[18296]: FATAL: invalid byte sequence for encoding "UTF8": 0xc6 0xd8 child process exited with exit code 1 initdb: removing data directory "C:/Users/ebakke/Deletables/pgtest" =================================================
child process exited with exit code 1
initdb: removing data directory "C:/Users/ebakke/Deletables/pgtest"
=================================================

The problem does _not_ occur on Linux or MacOS (I tested on all three OSes).
The error goes away if an ASCII-only path is used. This may not always be an
option, however, as Windows users might only have access to their home
directories, and home directory names are permitted to contain such
characters. The error also goes away if "--encoding=UTF8" is removed, but
this forces the entire database to use the Windows-1252 character set, which
is not acceptable.

The Windows binaries for PostgreSQL 13.1 were downloaded from
https://www.enterprisedb.com/download-postgresql-binaries . The error is
easy to reproduce, and occurs every time. In case the non-ASCII letters are
mangled in this message, the non-ASCII characters I tested with are the last
three ones in the Norwegian and Danish alphabet, which you can copy and
paste from https://en.wikipedia.org/wiki/Danish_and_Norwegian_alphabet .

This bug was discovered while working on Ultorg ( https://twitter.com/ultorg
), a desktop application which bundles PostgreSQL binaries and requires them
to run from wherever the user unzips the application package.

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

PG Bug reporting form <noreply@postgresql.org> writes:

On PostgreSQL 13.1 on US English Windows 10, "initdb" will fail with the
following error if the initdb.exe executable is located on a path that
contains certain non-ASCII characters, and "--encoding=UTF8" is specified.
In the following example, I am executing initdb.exe from a folder called
"C:\Users\ebakke\ZRoot\PostgresTest\FolderÆØÅ\pgsql\bin":

I'm afraid this is largely a case of "Doctor, it hurts when I do this!"
... "So don't do that." Although we could possibly fix initdb to not
fail under these circumstances, I think you'd have continuing pain from
not being able to represent the installation path in the database
encoding. References to, for example, the script files for standard
extensions would be impossible to write in SQL.

I was curious enough to dig for exactly where initdb has a problem,
and I found that it's where it generates a COPY command to populate
information_schema.sql_features from .../share/sql_features.txt in
the installation file tree. The backend's expecting the COPY to be
entirely valid UTF8 text, but the pathname string won't be.

I thought of using COPY FROM STDIN, but that doesn't work in a
standalone backend, and I doubt we want to expend the effort to
make it do so. Or we could have initdb convert the data to a large
INSERT...VALUES command. But on the whole, given the likely follow-on
issues people would have with this sort of situation, it doesn't seem
worth putting effort into fixing just this particular place in initdb.

Thinking about bigger-picture solutions, the first idea that comes
to mind is that maybe file paths ought to be treated as bytea rather
than text, since we have no good reason to expect that they are in
any particular encoding. But making that happen would be a research
project, and it'd likely result in some unpleasant compatibility
breakage.

In short, this doesn't seem likely to get improved any time soon.

regards, tom lane

Eirik Bakke

ebakke@ultorg.com

about 5 years ago

In reply to: Tom Lane (#2)

RE: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

Thanks for investigating!

The problem may be unavoidable if the user's home directory name contains special characters. This is common on Windows, where the user name is often the user's real, human name (e.g. "Bjørn Dæhlie").

I think you'd have continuing pain from not being able to represent the installation path in the database encoding

The installation path can certainly be represented in UTF-8, but a character set conversion is necessary. Wouldn't "SET CLIENT_ENCODING" accomplish this? For example, couldn't initdb in this case do a "SET CLIENT_ENCODING TO 'Windows-1252'" before issuing the COPY command?

On the server side, I'd expect the COPY command to similarly convert the path from the character set used in the client protocol to whichever character set is expected by the file system. But I don't know if this is done...

the first idea that comes to mind is that maybe file paths ought to be treated as bytea rather than text

Doing the appropriate character set conversions would avoid this--file paths can still be treated as "text". And UTF-8 will happily encode every character of every other supported encoding.

-- Eirik

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sunday, March 14, 2021 6:39 PM
To: Eirik Bakke <ebakke@ultorg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

PG Bug reporting form <noreply@postgresql.org> writes:

On PostgreSQL 13.1 on US English Windows 10, "initdb" will fail with
the following error if the initdb.exe executable is located on a path
that contains certain non-ASCII characters, and "--encoding=UTF8" is specified.
In the following example, I am executing initdb.exe from a folder
called
"C:\Users\ebakke\ZRoot\PostgresTest\FolderÆØÅ\pgsql\bin":

I'm afraid this is largely a case of "Doctor, it hurts when I do this!"
... "So don't do that." Although we could possibly fix initdb to not fail under these circumstances, I think you'd have continuing pain from not being able to represent the installation path in the database encoding. References to, for example, the script files for standard extensions would be impossible to write in SQL.

I was curious enough to dig for exactly where initdb has a problem, and I found that it's where it generates a COPY command to populate information_schema.sql_features from .../share/sql_features.txt in the installation file tree. The backend's expecting the COPY to be entirely valid UTF8 text, but the pathname string won't be.

I thought of using COPY FROM STDIN, but that doesn't work in a standalone backend, and I doubt we want to expend the effort to make it do so. Or we could have initdb convert the data to a large INSERT...VALUES command. But on the whole, given the likely follow-on issues people would have with this sort of situation, it doesn't seem worth putting effort into fixing just this particular place in initdb.

Thinking about bigger-picture solutions, the first idea that comes to mind is that maybe file paths ought to be treated as bytea rather than text, since we have no good reason to expect that they are in any particular encoding. But making that happen would be a research project, and it'd likely result in some unpleasant compatibility breakage.

In short, this doesn't seem likely to get improved any time soon.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Eirik Bakke (#3)

Re: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

Eirik Bakke <ebakke@ultorg.com> writes:

The installation path can certainly be represented in UTF-8, but a character set conversion is necessary. Wouldn't "SET CLIENT_ENCODING" accomplish this? For example, couldn't initdb in this case do a "SET CLIENT_ENCODING TO 'Windows-1252'" before issuing the COPY command?

The issues are (1) how should initdb know that the path name ought to be
taken as WIN1252, rather than some other encoding? And then (2) how
should the backend know that it must convert the path-name-represented-
in-UTF8 to WIN1252 before passing it to the file system? The lack of
any standardization about what encoding file names are in is the core
of the difficulty.

I don't know much about Windows, so it's possible that there actually
are platform-specific ways to answer these questions. But it's a generic
problem, and we're unlikely to be interested in building a single-platform
fix.

regards, tom lane

Eirik Bakke

ebakke@ultorg.com

about 5 years ago

In reply to: Tom Lane (#4)

RE: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

The issues are (1) how should initdb know that the path name ought to be taken as WIN1252, rather than some other encoding?

Researching this some more, on Windows, the old codepage-dependent APIs are essentially deprecated, in favor of functions that pass two-byte unicode strings (wchar_t/LPWSTR). So instead of getting the binary path location from argv[0], one would use GetCommandLineW and CommandLineToArgvW, or more directly, GetModuleFileNameW. That way one never has to guess or detect which encoding is being used.

https://docs.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew
https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw
https://docs.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-getmodulefilenamew

And then (2) how should the backend know that it must convert the path-name-represented-in-UTF8 to WIN1252 before passing it to the file system?

Similarly, on Windows one is expected to use the wchar_t version of the relevant file system calls, e.g. _wfopen instead of fopen.

https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen

But it's a generic problem, and we're unlikely to be interested in building a single-platform fix.

Alas! I'd imagine there would be some "#ifdef WIN32". But thanks for responding.

-- Eirik

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, March 15, 2021 12:06 PM
To: Eirik Bakke <ebakke@ultorg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #16926: initdb fails on Windows when binary path contains certain non-ASCII characters

Eirik Bakke <ebakke@ultorg.com> writes:

The installation path can certainly be represented in UTF-8, but a character set conversion is necessary. Wouldn't "SET CLIENT_ENCODING" accomplish this? For example, couldn't initdb in this case do a "SET CLIENT_ENCODING TO 'Windows-1252'" before issuing the COPY command?

The issues are (1) how should initdb know that the path name ought to be taken as WIN1252, rather than some other encoding? And then (2) how should the backend know that it must convert the path-name-represented-
in-UTF8 to WIN1252 before passing it to the file system? The lack of any standardization about what encoding file names are in is the core of the difficulty.

I don't know much about Windows, so it's possible that there actually are platform-specific ways to answer these questions. But it's a generic problem, and we're unlikely to be interested in building a single-platform fix.

regards, tom lane