splitting pg_resetwal output strings

Started by Alvaro Herrera6 months ago11 messageshackers

alvherre@2ndquadrant.com

6 months ago

Hello,

For a long time I've been annoyed that certain tools such as (but not
only) pg_resetwal have translatable strings full of whitespace that
translators have to be careful about so that things look nice.
For Spanish I have made the lines a bit wider so that everything fits,
which means that every time someone adds a new line that's not yet
translated, the output looks bad; and if I want to add one more space
to align a new longer string, then I have to edit every single one of
them. This is horrible.

Here's a proposal. The idea is to have a separate file (entries.h right
now but proposal for better names are welcome) which lists those
strings, together with the printf specifiers needed to actually print
them. This way, we can measure the length of each exactly as they
translate before printing anything, and then line up everything to the
same output length.

One curious thing of note is that I had to add an internal_wcswidth()
function, to avoid having to link libpq just to be able to do
pg_wcswidth().

This is not complete. It modifies PrintControlValues(), which is easy
because I just print each item in the entries.h file in the same order
they appear there. But for PrintNewControlValues() I'll need to add a
symbolic identifier to each string, that the code can use to scan the
array and print just those elements. I'll do that if there are no
objections to this idea here. Also, pg_controldata could probably
something very similar or maybe the same entries.h file.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"If you want to have good ideas, you must have many ideas. Most of them
will be wrong, and what you have to learn is which ones to throw away."
(Linus Pauling)

Alvaro Herrera

alvherre@2ndquadrant.com

6 months ago

In reply to: Alvaro Herrera (#1)

Re: splitting pg_resetwal output strings

On 2026-Jan-31, Álvaro Herrera wrote:

This is not complete. It modifies PrintControlValues(), which is easy
because I just print each item in the entries.h file in the same order
they appear there. But for PrintNewControlValues() I'll need to add a
symbolic identifier to each string, that the code can use to scan the
array and print just those elements. I'll do that if there are no
objections to this idea here.

It looks more or less like this. Patch 0001 is the same as before, and
0002 adds the symbolic names, which is used to create an enum and then
to search for the correct lines to print in PrintNewControlValues.

I decided to reuse simple_oid_list to store the integers values, which
is kinda icky, so getting to something committable I think would have me
add simple_int_list. Also, there's a few blocks that are duplicate
cases of the same line measuring and printing logic; I suppose I should
have a routine to simplify.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
sources, so let's move on." (Nathaniel Smith)
https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html

Chao Li

li.evan.chao@gmail.com

6 months ago

In reply to: Alvaro Herrera (#2)

Re: splitting pg_resetwal output strings

On Jan 31, 2026, at 23:08, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2026-Jan-31, Álvaro Herrera wrote:

This is not complete. It modifies PrintControlValues(), which is easy
because I just print each item in the entries.h file in the same order
they appear there. But for PrintNewControlValues() I'll need to add a
symbolic identifier to each string, that the code can use to scan the
array and print just those elements. I'll do that if there are no
objections to this idea here.

It looks more or less like this. Patch 0001 is the same as before, and
0002 adds the symbolic names, which is used to create an enum and then
to search for the correct lines to print in PrintNewControlValues.

I decided to reuse simple_oid_list to store the integers values, which
is kinda icky, so getting to something committable I think would have me
add simple_int_list. Also, there's a few blocks that are duplicate
cases of the same line measuring and printing logic; I suppose I should
have a routine to simplify.

Hi Alvaro,

I don't think we necessarily need to add a simple_int_list. Since OIDs are not globally unique across all object types, we can reasonably treat ControlData items as "objects" and assign them OIDs within this context.

If we take that approach, the enum could be named ControlDataOid:
```
/* Define the string enums */
#define CONTROLDATA_LINE(symbol, description, fmt, ...) \
symbol,
enum {
#include "entries.h”
} ControlDataOid;
#undef CONTROLDATA_LINE
```

This makes reusing simple_oid_list feel appropriate. If you aren't fond of that idea, I’d suggest naming the enum ControlDataSymbol rather than ControlDataStrings, as the former better reflects what the members actually are.

Regarding patch 0002:
```
+#define CONTROLDATA_LINE(symbol, description, fmt, ...)			\
+	if (simple_oid_list_member(&toprint, symbol))				\
+	{															\
+	thislen = internal_wcswidth(_(description),					\
+								strlen(_(description)),			\
+								encoding);						\
+	if (thislen > maxlen)										\
+		maxlen = thislen;										\
 	}
+#include "entries.h"
+#undef CONTROLDATA_LINE
```
The code block inside the first if statement is missing an indentation level.

Otherwise, the patch looks good to me. I've tested it and confirmed that calculating the padding at runtime generates better-looking output.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Alvaro Herrera

alvherre@2ndquadrant.com

6 months ago

In reply to: Chao Li (#3)

Re: splitting pg_resetwal output strings

Hello Evan,

On 2026-Feb-02, Chao Li wrote:

I don't think we necessarily need to add a simple_int_list. Since OIDs
are not globally unique across all object types, we can reasonably
treat ControlData items as "objects" and assign them OIDs within this
context.

If we take that approach, the enum could be named ControlDataOid:
```
/* Define the string enums */
#define CONTROLDATA_LINE(symbol, description, fmt, ...) \
symbol,
enum {
#include "entries.h”
} ControlDataOid;
#undef CONTROLDATA_LINE
```

This makes reusing simple_oid_list feel appropriate.

True, but that's a bit cheating with the naming. Really, type Oid is
related to identifying database objects, that is, what is used in each
catalog row. Someday we may want to decide to enlarge that to 64 bits,
and if that happens and yet the C compiler continues to use 32-bit
integers to implement C-level enums, which is what this patch uses, what
then?

If I go with just simple_int_list, then that's always (by definition)
going to be a C integer, which is (presumably) always going to be what
the C compiler uses to implement a C enum.

(I think in reality we're unlikely to change Oid to be 64-bit, because
there's no reason to do so since it's quite unusual to have such very
many database objects. I think it's more likely that we'd change the
type that underlies TOAST pointers, which is where the real problem with
Oid resides. Even so, it seems ... dangerous? ... to me to do what
you propose.)

If you aren't fond of that idea, I’d suggest naming the enum
ControlDataSymbol rather than ControlDataStrings, as the former better
reflects what the members actually are.

Yeah, this is probably a good idea anyway, given that these objects are
not strings anymore.

Thanks for the review!

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

Peter Eisentraut

peter_e@gmx.net

5 months ago

In reply to: Alvaro Herrera (#1)

Re: splitting pg_resetwal output strings

On 31.01.26 10:41, Álvaro Herrera wrote:

Here's a proposal. The idea is to have a separate file (entries.h right
now but proposal for better names are welcome) which lists those
strings, together with the printf specifiers needed to actually print
them. This way, we can measure the length of each exactly as they
translate before printing anything, and then line up everything to the
same output length.

I think this basically also applies to most --help output and some other
things. So maybe having a general way to print a list of things
automatically aligned in two columns would be good.

Alvaro Herrera

alvherre@2ndquadrant.com

5 months ago

In reply to: Peter Eisentraut (#5)

Re: splitting pg_resetwal output strings

On 2026-Feb-13, Peter Eisentraut wrote:

On 31.01.26 10:41, Álvaro Herrera wrote:

Here's a proposal. The idea is to have a separate file (entries.h right
now but proposal for better names are welcome) which lists those
strings, together with the printf specifiers needed to actually print
them. This way, we can measure the length of each exactly as they
translate before printing anything, and then line up everything to the
same output length.

I think this basically also applies to most --help output and some other
things. So maybe having a general way to print a list of things
automatically aligned in two columns would be good.

Yeah, that was my thinking as well, but I wanted to start small in order
to ensure it works properly everywhere and then have a month long
emergency if everything breaks; also, reaching consensus on exactly how
to use it in --help and other places is going to require some longer
discussion.

So I propose we commit this for 19, with some obvious cosmetic cleanups,
then we make the technique more general (in src/common/ or something)
and use it in other tools for 20.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Jonathan Gonzalez V.

jonathan.abdiel@gmail.com

2 months ago

In reply to: Alvaro Herrera (#2)

Re: splitting pg_resetwal output strings

Hello!

I've tested the patch, and I found that in the first patch, the nsl.mk
file is changed to match the first argument of CONTROLDATA_LINE as the
string to match:

+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS) \
+				  CONTROLDATA_LINE:1

but in the second patch the CONTROLDATA_LINE moves the strings to the
second argument:

-CONTROLDATA_LINE("pg_control version number",
+CONTROLDATA_LINE(CD_CONTROL_VERSION, "pg_control version number",
 				"%u", ControlFile.pg_control_version)

While keeping the argument as it is, I suggest to switch to
CONTROLDATA_LINE:2 like this:

+GETTEXT_TRIGGERS = $(FRONTEND_COMMON_GETTEXT_TRIGGERS) \
+				  CONTROLDATA_LINE:2

This will make `make update-po` take the strings from entries.h, which
is what I was expecting.

During testing, I found a couple of things that may be interesting, or
at least worth keeping in mind for the future implementation already
mentioned [1]/messages/by-id/202509300530.jnwbfyrdxaah@alvherre.pgsql.

Even if the spacing calculation is correct, I found that the font size
is not the same for different languages, so if a strings or part of it,
is translated into, for example, Hindi, and most of it it's in English,
it will affect how it look in the terminal, meaning that it's probably
that a the second column will be moved to the left generating a
misalignment in the printed data

Regards!

[1]: /messages/by-id/202509300530.jnwbfyrdxaah@alvherre.pgsql
/messages/by-id/202509300530.jnwbfyrdxaah@alvherre.pgsql
[2]: https://sw.kovidgoyal.net/kitty/

--
Jonathan Gonzalez V.
EDB: https://www.enterprisedb.com

Alvaro Herrera

alvherre@2ndquadrant.com

2 months ago

In reply to: Jonathan Gonzalez V. (#7)

Re: splitting pg_resetwal output strings

On 2026-May-23, Jonathan Gonzalez V. wrote:

Hello!

I've tested the patch, and I found that in the first patch, the nsl.mk
file is changed to match the first argument of CONTROLDATA_LINE as the
string to match:

[...]

While keeping the argument as it is, I suggest to switch to
CONTROLDATA_LINE:2 like this:

Yeah, you're correct, this is a bug in 0002. Thanks!

During testing, I found a couple of things that may be interesting, or
at least worth keeping in mind for the future implementation already
mentioned [1].

Even if the spacing calculation is correct, I found that the font size
is not the same for different languages, so if a strings or part of it,
is translated into, for example, Hindi, and most of it it's in English,
it will affect how it look in the terminal, meaning that it's probably
that a the second column will be moved to the left generating a
misalignment in the printed data

I don't think varying font width is something we should care about. The
multibyte system has a mechanism to tell us how wide the characters are,
in units of some monospace font. This is what the pg_wcswidth calls are
there for (or internal_wcslength in the patch). If the terminal doesn't
align the characters correctly according to the declared width because
different fonts are used, that's not our bug.

The idea behind the patch is to measure the length of all strings after
translation. If a number of the strings are not translated, then we
still measure them in their final presentation form. Then we align the
two columns based on the maximum length of all strings. I don't think
this is any worse than what you get with the original code; if a
translator has chosen a different alignment than what English has, and
some strings are not translated, then the output will appear misaligned.
In fact, with the code it will be better (assuming no font issues),
because we will try to align correctly even the strings that aren't
translated. In fact, if fonts differ, there's nothing the translator
can do, because there's no way to know what's the user going to do at
runtime -- users are going to have different terminals and different
fonts.

[2] https://sw.kovidgoyal.net/kitty/

Hmm, I'm not sure what you wanted to show with this URL -- can you
elaborate?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Jonathan Gonzalez V.

jonathan.abdiel@gmail.com

2 months ago

In reply to: Alvaro Herrera (#8)

Re: splitting pg_resetwal output strings

[...]

I don't think varying font width is something we should care about.
The
multibyte system has a mechanism to tell us how wide the characters
are,
in units of some monospace font. This is what the pg_wcswidth calls
are
there for (or internal_wcslength in the patch). If the terminal
doesn't
align the characters correctly according to the declared width
because
different fonts are used, that's not our bug.

Exactly, I forgot to clarify that this is to keep in mind if someone
else want to test and reach the same point and had the same questions.

[2] https://sw.kovidgoyal.net/kitty/

Hmm, I'm not sure what you wanted to show with this URL -- can you
elaborate?

Just an example of a terminal if someone wants to test something
different, but yeah lack of context.

--
Jonathan Gonzalez V.
EDB: https://www.enterprisedb.com

#10

Jonathan Gonzalez V.

jonathan.abdiel@gmail.com

2 months ago

In reply to: Jonathan Gonzalez V. (#9)

Re: splitting pg_resetwal output strings

Hello,

On a second round a notice these two strings:

+               char *str = "First log segment after reset";
+               thislen = internal_wcswidth(_(str), strlen(_(str)),
encoding);
+               if (thislen > maxlen)

and

+               char *str = "First log segment after reset";
+               thislen = internal_wcswidth(_(str), strlen(_(str)),
encoding);
+               if (thislen > maxlen)

Are never catch by the `make update-po` command, instead, the strings
are there in the `.po.new` file commented.

Should we put them between `_()`? or is expected that it's being catch
by the second line with the `_()` ?

--
Jonathan Gonzalez V.
EDB: https://www.enterprisedb.com

#11

Alvaro Herrera

alvherre@2ndquadrant.com

2 months ago

In reply to: Jonathan Gonzalez V. (#10)

Re: splitting pg_resetwal output strings

Hi,

On 2026-May-24, Jonathan Gonzalez V. wrote:

On a second round a notice these two strings:

+               char *str = "First log segment after reset";
+               thislen = internal_wcswidth(_(str), strlen(_(str)),
encoding);
+               if (thislen > maxlen)

and [...]

Are never catch by the `make update-po` command, instead, the strings
are there in the `.po.new` file commented.

Should we put them between `_()`? or is expected that it's being catch
by the second line with the `_()` ?

Hmm, yeah, they need to be wrapped in gettext_noop() so that they are
not immediately translated (because that will be done by the _() call in
the other lines) but are picked up by msgmerge. (I think the second one
you reference is "NextXID epoch" but it was victim of a copy-paste
failure.)

Thanks again for looking,

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

splitting pg_resetwal output strings

Attachments:

Attachments: