Lift line-length limit for pg_service.conf
The pg_service.conf parsing thread [0] made me realize that we have a hardwired
line length of max 256 bytes. Lifting this would be in line with recent work
for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached moves
pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift
the restriction. Any reason not to do that while we're lifting other such limits?
cheers ./daniel
Attachments:
0001-Remove-arbitrary-line-length-in-pg_service.conf-pars.patchapplication/octet-stream; name=0001-Remove-arbitrary-line-length-in-pg_service.conf-pars.patch; x-unix-mode=0644Download+27-24
Daniel Gustafsson <daniel@yesql.se> writes:
The pg_service.conf parsing thread [0] made me realize that we have a hardwired
line length of max 256 bytes. Lifting this would be in line with recent work
for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached moves
pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift
the restriction. Any reason not to do that while we're lifting other such limits?
+1. I'd been thinking of looking around at our fgets calls to see
which ones need this sort of work, but didn't get to it yet.
Personally, I'd avoid depending on StringInfo.cursor here, as the
dependency isn't really buying you anything. Instead you could do
line = linebuf.data;
just after the trim-trailing-whitespace loop and then leave the
"ignore leading whitespace too" code as it stands.
Also, the need for inserting the pfree into multiple exit paths kind
of makes me itch. I wonder if we should change the ending code to
look like
exit:
fclose(f);
pfree(linebuf.data);
return result;
and then the early exit spots would be replaced with "result = x;
goto exit". (Some of them could use "break", but not all, so
it's probably better to consistently use "goto".)
regards, tom lane
On 21 Sep 2020, at 17:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Daniel Gustafsson <daniel@yesql.se> writes:
The pg_service.conf parsing thread [0] made me realize that we have a hardwired
line length of max 256 bytes. Lifting this would be in line with recent work
for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50). The attached moves
pg_service.conf to use the new pg_get_line_append API and a StringInfo to lift
the restriction. Any reason not to do that while we're lifting other such limits?+1. I'd been thinking of looking around at our fgets calls to see
which ones need this sort of work, but didn't get to it yet.
I took a quick look and found the TOC parsing in pg_restore which used a 100
byte buffer and then did some juggling to find EOL for >100b long lines. There
we wont see a bugreport for exceeded line length, but simplifying the code
seemed like a win to me so included that in the updated patch as well.
Personally, I'd avoid depending on StringInfo.cursor here, as the
dependency isn't really buying you anything.
Fair enough, I was mainly a bit excited at finally finding a use for .cursor =)
Fixed.
Also, the need for inserting the pfree into multiple exit paths kind
of makes me itch. I wonder if we should change the ending code to
look likeexit:
fclose(f);
pfree(linebuf.data);return result;
and then the early exit spots would be replaced with "result = x;
goto exit". (Some of them could use "break", but not all, so
it's probably better to consistently use "goto".)
Agreed, fixed. I was a bit tempted to use something less magic and more
descriptive than result = 3; but in the end opted for keeping changes to one
thing at a time.
cheers ./daniel
Attachments:
0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patchapplication/octet-stream; name=0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch; x-unix-mode=0644Download+51-65
Daniel Gustafsson <daniel@yesql.se> writes:
[ 0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch ]
I reviewed this and noticed that you'd missed adding resetStringInfo
calls in some code paths, which made me realize that while
pg_get_line_append() is great for its original customer in hba.c,
it kinda sucks for most other callers. Having to remember to do
resetStringInfo in every path through a loop is too error-prone,
and it's unnecessary. So I made another subroutine that just adds
that step, and updated the existing callers that could use it.
Pushed with that correction.
regards, tom lane
In the same vein, here's a patch to remove the hard-coded line length
limit for tsearch dictionary files.
regards, tom lane
Attachments:
0001-stringinfo-in-t_readline.patchtext/x-diff; charset=us-ascii; name=0001-stringinfo-in-t_readline.patchDownload+20-11
On 22 Sep 2020, at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In the same vein, here's a patch to remove the hard-coded line length
limit for tsearch dictionary files.
LGTM. I like the comment on why not to return buf.data directly, that detail
would be easy to miss.
cheers ./daniel
Daniel Gustafsson <daniel@yesql.se> writes:
On 22 Sep 2020, at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In the same vein, here's a patch to remove the hard-coded line length
limit for tsearch dictionary files.
LGTM. I like the comment on why not to return buf.data directly, that detail
would be easy to miss.
Yeah. In a quick scan, it appears that there is only one caller that
tries to save the result directly. So I considered making that caller
do a pstrdup and eliminating the extra thrashing in t_readline itself.
But it seemed too fragile; somebody would get it wrong and then have
excess space consumption for their dictionary.
regards, tom lane
I wrote:
Yeah. In a quick scan, it appears that there is only one caller that
tries to save the result directly. So I considered making that caller
do a pstrdup and eliminating the extra thrashing in t_readline itself.
But it seemed too fragile; somebody would get it wrong and then have
excess space consumption for their dictionary.
I had a better idea: if we get rid of t_readline() itself, which has
been deprecated for years anyway, we can have the calling layer
tsearch_readline_xxx maintain a StringInfo across the whole file
read process and thus get rid of alloc/free cycles for the StringInfo.
However ... while working on that, I realized that the usage of
the "curline" field in that code is completely unsafe. We save
a pointer to the result of tsearch_readline() for possible use
by the error context callback, *but the caller is likely to free that
string at some point*. So there is a window where an error would
result in trying to print already-freed data.
It looks like all of the core-code dictionaries free the result
string at the bottoms of their loops, so that the window for
trouble is pretty much empty. But contrib/dict_xsyn doesn't
do it like that, and so it's surely at risk. Any external
dictionaries that have copied that code would also be at risk.
So the attached adds a pstrdup/pfree to ensure that "curline"
has its own storage, putting us right back at two palloc/pfree
cycles per line. I don't think there's a lot of choice though;
in fact, I'm leaning to the idea that we need to back-patch
that part of this. The odds of trouble in a production build
probably aren't high, but still...
regards, tom lane
Attachments:
rework-tsearch-readline-2.patchtext/x-diff; charset=us-ascii; name=rework-tsearch-readline-2.patchDownload+51-50
I wrote:
So the attached adds a pstrdup/pfree to ensure that "curline"
has its own storage, putting us right back at two palloc/pfree
cycles per line. I don't think there's a lot of choice though;
in fact, I'm leaning to the idea that we need to back-patch
that part of this. The odds of trouble in a production build
probably aren't high, but still...
So I did that, but while looking at the main patch I realized that
a few things were still being left on the table:
1. We can save a palloc/pfree cycle in the case where no encoding
conversion need be performed, by allowing "curline" to point at
the StringInfo buffer instead of necessarily being a separate
palloc'd string. (This seems safe since no other code should
manipulate the StringInfo before the next tsearch_readline call;
so we can't get confused about whether "curline" has its own storage.)
2. In the case where encoding conversion is performed, we still have
to pstrdup the result to have a safe copy for "curline". But I
realized that we're making a poor choice of which copy to return to
the caller. The output of encoding conversion is likely to be a good
bit bigger than necessary, cf. pg_do_encoding_conversion. So if the
caller is one that saves the output string directly into a long-lived
dictionary structure, this wastes space. We should return the pstrdup
result instead, and keep the conversion result as "curline", where
we'll free it next time through.
3. AFAICT, it's completely useless for tsearch_readline to call
pg_verify_mbstr, because pg_any_to_server will either do that exact
thing itself, or verify the input encoding as part of conversion.
Some quick testing says that you don't even get different error
messages. So we should just drop that step. (It's likely that the
separate call was actually needed when this code was written; I think
we tightened up the expectations for verification within encoding
conversion somewhere along the line. But we demonstrably don't need
it now.)
So those considerations lead me to the attached.
regards, tom lane
Attachments:
rework-tsearch-readline-3.patchtext/x-diff; charset=us-ascii; name=rework-tsearch-readline-3.patchDownload+28-58
On 23 Sep 2020, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
So the attached adds a pstrdup/pfree to ensure that "curline"
has its own storage, putting us right back at two palloc/pfree
cycles per line. I don't think there's a lot of choice though;
in fact, I'm leaning to the idea that we need to back-patch
that part of this. The odds of trouble in a production build
probably aren't high, but still...So I did that, but while looking at the main patch I realized that
a few things were still being left on the table:
2. In the case where encoding conversion is performed, we still have
to pstrdup the result to have a safe copy for "curline". But I
realized that we're making a poor choice of which copy to return to
the caller. The output of encoding conversion is likely to be a good
bit bigger than necessary, cf. pg_do_encoding_conversion. So if the
caller is one that saves the output string directly into a long-lived
dictionary structure, this wastes space. We should return the pstrdup
result instead, and keep the conversion result as "curline", where
we'll free it next time through.
I wonder if we have other callsites of pg_any_to_server which could benefit
from lowering the returned allocation, a quick look didn't spot any but today
has become yesterday here and tiredness might interfere.
So those considerations lead me to the attached.
Eyeing memory used by callers of tsearch_readline validates your observations,
I don't see any case where this patch isn't safe and nothing sticks out. +1 on
this, nice one!
cheers ./daniel
Daniel Gustafsson <daniel@yesql.se> writes:
On 23 Sep 2020, at 21:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
2. In the case where encoding conversion is performed, we still have
to pstrdup the result to have a safe copy for "curline". But I
realized that we're making a poor choice of which copy to return to
the caller. The output of encoding conversion is likely to be a good
bit bigger than necessary, cf. pg_do_encoding_conversion. So if the
caller is one that saves the output string directly into a long-lived
dictionary structure, this wastes space. We should return the pstrdup
result instead, and keep the conversion result as "curline", where
we'll free it next time through.
I wonder if we have other callsites of pg_any_to_server which could benefit
from lowering the returned allocation, a quick look didn't spot any but today
has become yesterday here and tiredness might interfere.
After looking more closely, I've realized that actually none of the
existing core-code callers will save the returned string directly.
readstoplist() could do so, depending on what "wordop" is, but
all its existing callers pass lowerstr() which will always make a
new output string. (Which itself could be a bit bloated :-()
So the concern I expressed above is really just hypothetical.
Still, the code is simpler this way and no slower, so I still
think it's an improvement.
(The bigger picture here is that the API for dictionary init
methods is pretty seriously misdesigned from a memory-consumption
standpoint. Running the entire init process in the dictionary's
long-lived context goes against everything we've learned about
avoiding memory leaks. We should run that code in a short-lived
command execution context, and explicitly copy just the data we
want into the long-lived context. But changing that would be
a pretty big deal, breaking third-party dictionaries. So I'm
not sure it's enough of a problem to justify the change.)
regards, tom lane