Cleaning up array_in()
This is in response to Alexander's observation at [1]/messages/by-id/9cd163da-d096-7e9e-28f6-f3620962a660@gmail.com, but I'm
starting a fresh thread to keep this patch separate from the plperl
fixes in the cfbot's eyes.
Alexander Lakhin <exclusion@gmail.com> writes:
I continue watching the array handling bugs dancing Sirtaki too. Now it's
another asymmetry:
select '{{1},{{2}}}'::int[];
{{{1}},{{2}}}
but:
select '{{{1}},{2}}'::int[];
{}
Bleah. Both of those should be rejected, for sure, but it's the same
situation as in the PLs: we weren't doing anything to enforce that all
the scalar elements appear at the same nesting depth.
I spent some time examining array_in(), and was pretty disheartened
by what a mess it is. It looks like back in the dim mists of the
Berkeley era, there was an intentional attempt to allow
non-rectangular array input, with the missing elements automatically
filled out as NULLs. Since that was undocumented, we concluded it was
a bug and plastered on some code to check for rectangularity of the
input. I don't quibble with enforcing rectangularity, but the
underlying logic should have been simplified while we were at it.
The element-counting logic was basically magic (why is it okay to
increment temp[ndim - 1] when the current nest_level might be
different from that?) and the extra layers of checks didn't make it
any more intelligible. Plus, ReadArrayStr was expending far more
cycles than it needs to given the assumption of rectangularity.
So, here's a rewrite.
Although I view this as a bug fix, AFAICT the only effects are to
accept input that should be rejected. So again I don't advocate
back-patching. But should we sneak it into v16, or wait for v17?
regards, tom lane
[1]: /messages/by-id/9cd163da-d096-7e9e-28f6-f3620962a660@gmail.com
Attachments:
v1-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-diff; charset=us-ascii; name=v1-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-37
v1-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-diff; charset=us-ascii; name*0=v1-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.p; name*1=atchDownload+253-164
On Tue, May 02, 2023 at 11:41:27AM -0400, Tom Lane wrote:
It looks like back in the dim mists of the
Berkeley era, there was an intentional attempt to allow
non-rectangular array input, with the missing elements automatically
filled out as NULLs. Since that was undocumented, we concluded it was
a bug and plastered on some code to check for rectangularity of the
input.
Interesting.
Although I view this as a bug fix, AFAICT the only effects are to
accept input that should be rejected. So again I don't advocate
back-patching. But should we sneak it into v16, or wait for v17?
I think it'd be okay to sneak it into v16, given it is technically a bug
fix.
(This leaves ArrayGetOffset0() unused, but I'm unsure whether to
remove that.)
Why's that? Do you think it is likely to be used again in the future?
Otherwise, 0001 LGTM.
I haven't had a chance to look at 0002 closely yet.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
02.05.2023 18:41, Tom Lane wrote:
So, here's a rewrite.
Although I view this as a bug fix, AFAICT the only effects are to
accept input that should be rejected. So again I don't advocate
back-patching. But should we sneak it into v16, or wait for v17?
I've tested the patch from a user perspective and found no interesting cases
that were valid before, but not accepted with the patch (or vice versa):
The only thing that confused me, is the error message (it's not new, too):
select '{{{{{{{{{{1}}}}}}}}}}'::int[];
or even:
select '{{{{{{{{{{'::int[];
ERROR: number of array dimensions (7) exceeds the maximum allowed (6)
Maybe it could be reworded like that?:
too many opening braces defining dimensions (maximum dimensions allowed: 6)
Beside that, I would like to note the following error text changes
(all of these are feasible, I think):
select '{{1},{{'::int[];
Before:
ERROR: malformed array literal: "{{1},{{"
LINE 1: select '{{1},{{'::int[];
^
DETAIL: Unexpected end of input.
After:
ERROR: malformed array literal: "{{1},{{"
LINE 1: select '{{1},{{'::int[];
^
DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
---
select '{{1},{{{{{{'::int[];
Before:
ERROR: number of array dimensions (7) exceeds the maximum allowed (6)
After:
ERROR: malformed array literal: "{{1},{{{{{{"
LINE 1: select '{{1},{{{{{{'::int[];
^
DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
---
select '{{1},{}}}'::int[];
Before:
ERROR: malformed array literal: "{{1},{}}}"
LINE 1: select '{{1},{}}}'::int[];
^
DETAIL: Unexpected "}" character.
After:
ERROR: malformed array literal: "{{1},{}}}"
LINE 1: select '{{1},{}}}'::int[];
^
DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
---
select '{{}}}'::int[];
Before:
ERROR: malformed array literal: "{{}}}"
LINE 1: select '{{}}}'::int[];
^
DETAIL: Unexpected "}" character.
After:
ERROR: malformed array literal: "{{}}}"
LINE 1: select '{{}}}'::int[];
^
DETAIL: Junk after closing right brace.
Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
The only thing that confused me, is the error message (it's not new, too):
select '{{{{{{{{{{1}}}}}}}}}}'::int[];
or even:
select '{{{{{{{{{{'::int[];
ERROR: number of array dimensions (7) exceeds the maximum allowed (6)
Yeah, I didn't touch that, but it's pretty bogus because the first
number will always be "7" even if you wrote more than 7 left braces,
since the code errors out immediately upon finding that it's seen
too many braces.
The equivalent message in the PLs just says "number of array dimensions
exceeds the maximum allowed (6)". I'm inclined to do likewise in
array_in, but didn't touch it here.
Beside that, I would like to note the following error text changes
(all of these are feasible, I think):
I'll look into whether we can improve those, unless you had a patch
in mind already?
regards, tom lane
09.05.2023 06:06, Tom Lane wrote:
Alexander Lakhin <exclusion@gmail.com> writes:
The only thing that confused me, is the error message (it's not new, too):
select '{{{{{{{{{{1}}}}}}}}}}'::int[];
or even:
select '{{{{{{{{{{'::int[];
ERROR: number of array dimensions (7) exceeds the maximum allowed (6)Yeah, I didn't touch that, but it's pretty bogus because the first
number will always be "7" even if you wrote more than 7 left braces,
since the code errors out immediately upon finding that it's seen
too many braces.The equivalent message in the PLs just says "number of array dimensions
exceeds the maximum allowed (6)". I'm inclined to do likewise in
array_in, but didn't touch it here.
I think that, strictly speaking, we have no array dimensions in the string
'{{{{{{{{{{'; there are only characters (braces) during the string parsing.
While in the PLs we definitely deal with real arrays, which have dimensions.
Beside that, I would like to note the following error text changes
(all of these are feasible, I think):I'll look into whether we can improve those, unless you had a patch
in mind already?
Those messages looked more or less correct to me, I just wanted to note how they are
changing (and haven't highlighted messages, that are not), but if you see here room
for improvement, please look into it (I have no good formulations yet).
Best regards,
Alexander
I took a look at 0002 because I attempted a similar but more surgical
fix in [0]/messages/by-id/CAPWqQZRHsFuvWJj=czXuKEB03LF4ctPpDE1k3CoexweEFicBKQ@mail.gmail.com.
I spotted a few opportunities for further reducing state tracked by
`ArrayCount`. You may not find all of these suggestions to be
worthwhile.
1) `in_quotes` appears to be wholly redundant with `parse_state ==
ARRAY_QUOTED_ELEM_STARTED`.
2) The `empty_array` special case does not seem to be important to
ArrayCount's callers, which don't even special case `ndims == 0` but
rather `ArrayGetNItemsSafe(..) == 0`. Perhaps this is a philosophical
question as to whether `ArrayCount('{{}, {}}')` should return
(ndims=2, dims=[2, 0]) or (ndims=0). Obviously someone needs to do
that normalization, but `ArrayCount` could leave that normalization to
`ReadArrayStr`.
3) `eoArray` could be replaced with a new `ArrayParseState` of
`ARRAY_END`. Just a matter of taste, but "end of array" feels like a
parser state to me.
I also have a sense that `ndims_frozen` made the distinction between
`ARRAY_ELEM_DELIMITED` and `ARRAY_LEVEL_DELIMITED` unnecessary, and
the two states could be merged into a single `ARRAY_DELIMITED` state,
but I've not pulled on this thread hard enough to say so confidently.
Thanks for doing the serious overhaul. As you say, the element
counting logic is much easier to follow now. I'm much more confident
that your patch is correct than mine.
Cheers,
Nikhil
[0]: /messages/by-id/CAPWqQZRHsFuvWJj=czXuKEB03LF4ctPpDE1k3CoexweEFicBKQ@mail.gmail.com
Show quoted text
On Tue, May 9, 2023 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
09.05.2023 06:06, Tom Lane wrote:
Alexander Lakhin <exclusion@gmail.com> writes:
The only thing that confused me, is the error message (it's not new, too):
select '{{{{{{{{{{1}}}}}}}}}}'::int[];
or even:
select '{{{{{{{{{{'::int[];
ERROR: number of array dimensions (7) exceeds the maximum allowed (6)Yeah, I didn't touch that, but it's pretty bogus because the first
number will always be "7" even if you wrote more than 7 left braces,
since the code errors out immediately upon finding that it's seen
too many braces.The equivalent message in the PLs just says "number of array dimensions
exceeds the maximum allowed (6)". I'm inclined to do likewise in
array_in, but didn't touch it here.I think that, strictly speaking, we have no array dimensions in the string
'{{{{{{{{{{'; there are only characters (braces) during the string parsing.
While in the PLs we definitely deal with real arrays, which have dimensions.Beside that, I would like to note the following error text changes
(all of these are feasible, I think):I'll look into whether we can improve those, unless you had a patch
in mind already?Those messages looked more or less correct to me, I just wanted to note how they are
changing (and haven't highlighted messages, that are not), but if you see here room
for improvement, please look into it (I have no good formulations yet).Best regards,
Alexander
Nikhil Benesch <nikhil.benesch@gmail.com> writes:
I took a look at 0002 because I attempted a similar but more surgical
fix in [0].
I spotted a few opportunities for further reducing state tracked by
`ArrayCount`.
Wow, thanks for looking! I've not run these suggestions to ground
(and won't have time for a few days), but they sound very good.
regards, tom lane
On Mon, Jun 5, 2023 at 9:48 AM Nikhil Benesch <nikhil.benesch@gmail.com> wrote:
I took a look at 0002 because I attempted a similar but more surgical
fix in [0].I spotted a few opportunities for further reducing state tracked by
`ArrayCount`. You may not find all of these suggestions to be
worthwhile.
I pull ArrayCount into a separate C function, regress test (manually)
based on patch regress test.
1) `in_quotes` appears to be wholly redundant with `parse_state ==
ARRAY_QUOTED_ELEM_STARTED`.
removing it works as expected.
3) `eoArray` could be replaced with a new `ArrayParseState` of
`ARRAY_END`. Just a matter of taste, but "end of array" feels like a
parser state to me.
works. (reduce one variable.)
I also have a sense that `ndims_frozen` made the distinction between
`ARRAY_ELEM_DELIMITED` and `ARRAY_LEVEL_DELIMITED` unnecessary, and
the two states could be merged into a single `ARRAY_DELIMITED` state,
but I've not pulled on this thread hard enough to say so confidently.
merging these states into one work as expected.
based on Nikhil Benesch idea.
The attached diff is based on
v1-0002-Rewrite-ArrayCount-to-make-dimensionality-checks.patch.
diff compare v1-0002:
select '{{1,{2}},{2,3}}'::text[];
ERROR: malformed array literal: "{{1,{2}},{2,3}}"
LINE 1: select '{{1,{2}},{2,3}}'::text[];
^
-DETAIL: Unexpected "{" character.
+DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
----------------------------------------------------
select E'{{1,2},\\{2,3}}'::text[];
ERROR: malformed array literal: "{{1,2},\{2,3}}"
LINE 1: select E'{{1,2},\\{2,3}}'::text[];
^
-DETAIL: Unexpected "\" character.
+DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
-------
new errors details kind of make sense.
Attachments:
0003-changed-based-on-Nikhil-Benesch-idea.patchtext/x-patch; charset=US-ASCII; name=0003-changed-based-on-Nikhil-Benesch-idea.patchDownload+14-26
Nikhil Benesch <nikhil.benesch@gmail.com> writes:
I spotted a few opportunities for further reducing state tracked by
`ArrayCount`. You may not find all of these suggestions to be
worthwhile.
I found some time today to look at these points.
1) `in_quotes` appears to be wholly redundant with `parse_state ==
ARRAY_QUOTED_ELEM_STARTED`.
I agree that it is redundant, but I'm disinclined to remove it because
the in_quotes logic matches that in ReadArrayStr. I think it's better
to keep those two functions in sync. The parse_state represents an
independent set of checks that need not be repeated by ReadArrayStr,
but both functions have to track quoting. The same for eoArray.
2) The `empty_array` special case does not seem to be important to
ArrayCount's callers, which don't even special case `ndims == 0` but
rather `ArrayGetNItemsSafe(..) == 0`. Perhaps this is a philosophical
question as to whether `ArrayCount('{{}, {}}')` should return
(ndims=2, dims=[2, 0]) or (ndims=0). Obviously someone needs to do
that normalization, but `ArrayCount` could leave that normalization to
`ReadArrayStr`.
This idea I do like. While looking at the callers, I also noticed
that it's impossible currently to write an empty array with explicit
specification of bounds. It seems to me that you ought to be able
to write, say,
SELECT '[1:0]={}'::int[];
but up to now you got "upper bound cannot be less than lower bound";
and if you somehow got past that, you'd get "Specified array
dimensions do not match array contents." because of ArrayCount's
premature optimization of "one-dimensional array with length zero"
to "zero-dimensional array". We can fix that by doing what you said
and adjusting the initial bounds restriction to be "upper bound cannot
be less than lower bound minus one".
I also have a sense that `ndims_frozen` made the distinction between
`ARRAY_ELEM_DELIMITED` and `ARRAY_LEVEL_DELIMITED` unnecessary, and
the two states could be merged into a single `ARRAY_DELIMITED` state,
but I've not pulled on this thread hard enough to say so confidently.
I looked at jian he's implementation of that and was not impressed:
I do not think the logic gets any clearer, and it seems to me that
this makes a substantial dent in ArrayCount's ability to detect syntax
errors. The fact that one of the test case error messages got better
seems pretty accidental to me. We can get the same result in a more
purposeful way by giving a different error message for
ARRAY_ELEM_DELIMITED.
So I end up with the attached. I went ahead and dropped
ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
where the new 0002 avoids re-indenting any existing code in order
to ease review, and then 0003 is just a mechanical application
of pgindent.
I still didn't do anything about "number of array dimensions (7)
exceeds the maximum allowed (6)". There are quite a few instances
of that wording, not only array_in's, and I'm not sure whether to
change the rest. In any case that looks like something that
could be addressed separately. The other error message wording
changes here seem to me to be okay.
regards, tom lane
Attachments:
v2-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-diff; charset=us-ascii; name=v2-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-53
v2-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-diff; charset=us-ascii; name*0=v2-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.p; name*1=atchDownload+230-93
v2-0003-Re-indent-ArrayCount.patchtext/x-diff; charset=us-ascii; name=v2-0003-Re-indent-ArrayCount.patchDownload+106-109
I wrote:
So I end up with the attached. I went ahead and dropped
ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
where the new 0002 avoids re-indenting any existing code in order
to ease review, and then 0003 is just a mechanical application
of pgindent.
That got sideswiped by ae6d06f09, so here's a trivial rebase to
pacify the cfbot.
regards, tom lane
#text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch] /home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
#text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
#text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch] /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch
I wrote:
That got sideswiped by ae6d06f09, so here's a trivial rebase to
pacify the cfbot.
Sigh, this time with patch.
regards, tom lane
Attachments:
v3-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-diff; charset=us-ascii; name=v3-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-53
v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-diff; charset=us-ascii; name*0=v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.p; name*1=atchDownload+230-93
v3-0003-Re-indent-ArrayCount.patchtext/x-diff; charset=us-ascii; name=v3-0003-Re-indent-ArrayCount.patchDownload+106-109
On 08/07/2023 19:08, Tom Lane wrote:
I wrote:
So I end up with the attached. I went ahead and dropped
ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
where the new 0002 avoids re-indenting any existing code in order
to ease review, and then 0003 is just a mechanical application
of pgindent.That got sideswiped by ae6d06f09, so here's a trivial rebase to
pacify the cfbot.#text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch] /home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
#text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
#text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch] /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch
Something's wrong with your attachments.
Hmm, I wonder if ae6d06f09 had a negative performance impact. In an
unquoted array element, scanner_isspace() function is called for every
character, so it might be worth inlining.
On the patches: They are a clear improvement, thanks for that. That
said, I still find the logic very hard to follow, and there are some
obvious performance optimizations that could be made.
ArrayCount() interprets low-level quoting and escaping, and tracks the
dimensions at the same time. The state machine is pretty complicated.
And when you've finally finished reading and grokking that function, you
see that ReadArrayStr() repeats most of the same logic. Ugh.
I spent some time today refactoring it for readability and speed. I
introduced a separate helper function to tokenize the input. It deals
with whitespace, escapes, and backslashes. Then I merged ArrayCount()
and ReadArrayStr() into one function that parses the elements and
determines the dimensions in one pass. That speeds up parsing large
arrays. With the tokenizer function, the logic in ReadArrayStr() is
still quite readable, even though it's now checking the dimensions at
the same time.
I also noticed that we used atoi() to parse the integers in the
dimensions, which doesn't do much error checking. Some funny cases were
accepted because of that, for example:
postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
text
-----------
{foo,bar}
(1 row)
I tightened that up in the passing.
Attached are your patches, rebased to fix the conflicts with ae6d06f09
like you intended. On top of that, my patches. My patches need more
testing, benchmarking, and review, so if we want to sneak something into
v16, better go with just your patches. If we're tightening up the
accepted inputs, maybe fix that atoi() sloppiness, though.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v3-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-patch; charset=UTF-8; name=v3-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-53
v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-patch; charset=UTF-8; name=v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchDownload+230-93
v3-0003-Re-indent-ArrayCount.patchtext/x-patch; charset=UTF-8; name=v3-0003-Re-indent-ArrayCount.patchDownload+106-109
v3-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchtext/x-patch; charset=UTF-8; name=v3-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchDownload+152-78
v3-0005-Determine-array-dimensions-and-parse-the-elements.patchtext/x-patch; charset=UTF-8; name=v3-0005-Determine-array-dimensions-and-parse-the-elements.patchDownload+431-600
Heikki Linnakangas <hlinnaka@iki.fi> writes:
On 08/07/2023 19:08, Tom Lane wrote:
That got sideswiped by ae6d06f09, so here's a trivial rebase to
pacify the cfbot.
Something's wrong with your attachments.
Yeah, I forgot to run mhbuild :-(
I spent some time today refactoring it for readability and speed. I
introduced a separate helper function to tokenize the input. It deals
with whitespace, escapes, and backslashes. Then I merged ArrayCount()
and ReadArrayStr() into one function that parses the elements and
determines the dimensions in one pass. That speeds up parsing large
arrays. With the tokenizer function, the logic in ReadArrayStr() is
still quite readable, even though it's now checking the dimensions at
the same time.
Oh, thanks for taking a look!
I also noticed that we used atoi() to parse the integers in the
dimensions, which doesn't do much error checking.
Yup, I'd noticed that too but not gotten around to doing anything
about it. I agree with nailing it down better as long as we're
tightening things in this area.
Attached are your patches, rebased to fix the conflicts with ae6d06f09
like you intended. On top of that, my patches. My patches need more
testing, benchmarking, and review, so if we want to sneak something into
v16, better go with just your patches.
At this point I'm only proposing this for v17, so additional cleanup
is welcome.
BTW, what's your opinion of allowing "[1:0]={}" ? Although that was
my proposal to begin with, I'm having second thoughts about it now.
The main reason is that the input transformation would be lossy,
eg "[1:0]={}" and "[101:100]={}" would give the same results, which
seems a little ugly. Given the lack of field complaints, maybe we
should leave that alone.
regards, tom lane
On 08/07/2023 22:49, Tom Lane wrote:
BTW, what's your opinion of allowing "[1:0]={}" ? Although that was
my proposal to begin with, I'm having second thoughts about it now.
The main reason is that the input transformation would be lossy,
eg "[1:0]={}" and "[101:100]={}" would give the same results, which
seems a little ugly.
Hmm, yeah, that would feel wrong if you did something like this:
select ('[2:1]={}'::text[]) || '{x}'::text[];
and expected it to return '[2:2]={x}'.
I guess we could allow "[1:0]={}" as a special case, but not
"[101:100]={}", but that would be weird too.
Given the lack of field complaints, maybe we should leave that
alone.
+1 to leave it alone. It's a little weird either way, so better to stay
put. We can revisit it later if we want to, but I wouldn't want to go
back and forth on it.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Sun, Jul 9, 2023 at 3:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/07/2023 19:08, Tom Lane wrote:
I wrote:
So I end up with the attached. I went ahead and dropped
ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
where the new 0002 avoids re-indenting any existing code in order
to ease review, and then 0003 is just a mechanical application
of pgindent.That got sideswiped by ae6d06f09, so here's a trivial rebase to
pacify the cfbot.#text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch] /home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
#text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
#text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch] /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patchSomething's wrong with your attachments.
Hmm, I wonder if ae6d06f09 had a negative performance impact. In an
unquoted array element, scanner_isspace() function is called for every
character, so it might be worth inlining.On the patches: They are a clear improvement, thanks for that. That
said, I still find the logic very hard to follow, and there are some
obvious performance optimizations that could be made.ArrayCount() interprets low-level quoting and escaping, and tracks the
dimensions at the same time. The state machine is pretty complicated.
And when you've finally finished reading and grokking that function, you
see that ReadArrayStr() repeats most of the same logic. Ugh.I spent some time today refactoring it for readability and speed. I
introduced a separate helper function to tokenize the input. It deals
with whitespace, escapes, and backslashes. Then I merged ArrayCount()
and ReadArrayStr() into one function that parses the elements and
determines the dimensions in one pass. That speeds up parsing large
arrays. With the tokenizer function, the logic in ReadArrayStr() is
still quite readable, even though it's now checking the dimensions at
the same time.I also noticed that we used atoi() to parse the integers in the
dimensions, which doesn't do much error checking. Some funny cases were
accepted because of that, for example:postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
text
-----------
{foo,bar}
(1 row)I tightened that up in the passing.
Attached are your patches, rebased to fix the conflicts with ae6d06f09
like you intended. On top of that, my patches. My patches need more
testing, benchmarking, and review, so if we want to sneak something into
v16, better go with just your patches. If we're tightening up the
accepted inputs, maybe fix that atoi() sloppiness, though.--
Heikki Linnakangas
Neon (https://neon.tech)
your idea is so clear!!!
all the Namings are way more descriptive. ArrayToken, personally
something with "state", "type" will be more clear.
/*
* FIXME: Is this still required? I believe all the checks it performs are
* redundant with other checks in ReadArrayDimension() and ReadArrayStr()
*/
nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
if (nitems_according_to_dims < 0)
PG_RETURN_NULL();
if (nitems != nitems_according_to_dims)
elog(ERROR, "mismatch nitems, %d vs %d", nitems, nitems_according_to_dims);
if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
PG_RETURN_NULL();
--first time run
select '[0:3][0:2]={{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[];
INFO: 253 after ReadArrayDimensions dim: 4 3 71803430 21998
103381120 21998 ndim: 2
INFO: 770 after ReadArrayStr: dim: 4 3 71803430 21998 103381120
21998 nitems:12, ndim:2
--second time run.
INFO: 253 after ReadArrayDimensions dim: 4 3 0 0 0 0 ndim: 2
INFO: 770 after ReadArrayStr: dim: 4 3 0 0 0 0 nitems:12, ndim:2
select '{{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[]; --every time run,
the result is the same.
INFO: 253 after ReadArrayDimensions dim: 0 0 0 0 0 0 ndim: 0
INFO: 770 after ReadArrayStr: dim: 4 3 -1 -1 -1 -1 nitems:12, ndim:2
I think the reason is that the dim int array didn't explicitly assign
value when initializing it.
/* Now it's safe to compute ub + 1 */
if (ub + 1 < lBound[ndim])
ereturn(escontext, false,
(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
errmsg("upper bound cannot be less than lower bound minus one")));
This part seems safe against cases like select
'[-2147483649:-2147483648]={1,2}'::int[];
but I am not sure. If so, then ArrayCheckBoundsSafe is unnecessary.
Another corner case successed: select '{1,}'::int[]; should fail.
hi.
based on Heikki v3.
I made some changes:
array_in: dim[6] all initialize with -1, lBound[6] all initialize with 1.
if ReadArrayDimensions called, then corresponding dimension lBound
will replace the initialized default 1 value.
ReadArrayStr, since array_in main function initialized dim array,
dimensions_specified true or false, I don't need to initialize again,
so I deleted that part.
to solve corner cases like '{{1,},{1},}'::text[]. in ReadArrayStr
main switch function, like other ArrayToken, first evaluate
expect_delim then assign expect_delim.
In ATOK_LEVEL_END. if non-empty array, closing bracket either precede
with an element or another closing element. In both cases, the
previous expect_delim should be true.
in
* FIXME: Is this still required? I believe all the checks it
performs are
* redundant with other checks in ReadArrayDimension() and
ReadArrayStr()
*/
I deleted
- nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
- if (nitems_according_to_dims < 0)
- PG_RETURN_NULL();
- if (nitems != nitems_according_to_dims)
- elog(ERROR, "mismatch nitems, %d vs %d", nitems,
nitems_according_to_dims);
but I am not sure if the following is necessary.
if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
PG_RETURN_NULL();
I added some corner case tests like select '{{1,},{1},}'::text[];
some changes broken:
select '{{1},{}}'::text[];
-DETAIL: Multidimensional arrays must have sub-arrays with matching dimensions.
+DETAIL: Unexpected "," character.
I added some error checks in ATOK_LEVEL_END. The first expect_delim
part check will first generate an error, the dimension error part will
not be reached.
Attachments:
v4-0001-based-on-Heikki-verion3.patchtext/x-patch; charset=US-ASCII; name=v4-0001-based-on-Heikki-verion3.patchDownload+55-17
hi.
attached v4.
v4, 0001 to 0005 is the same as v3 in
/messages/by-id/5859ce4e-2be4-92b0-c85c-e1e24eab57c6@iki.fi
v4-0006 doing some modifications to address the corner case mentioned
in the previous thread (like select '{{1,},{1},}'::text[]).
also fixed all these FIXME, Heikki mentioned in the code.
v4-0007 refactor ReadDimensionInt. to make the array dimension bound
variables within the INT_MIN and INT_MAX. so it will make select
'[21474836488:21474836489]={1,2}'::int[]; fail.
Attachments:
v4-0006-refactor-to-make-corner-case-1-int-fail.patchtext/x-patch; charset=US-ASCII; name=v4-0006-refactor-to-make-corner-case-1-int-fail.patchDownload+73-20
v4-0007-refactor-ReadDimensionInt.patchtext/x-patch; charset=US-ASCII; name=v4-0007-refactor-ReadDimensionInt.patchDownload+10-11
v4-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchDownload+152-78
v4-0005-Determine-array-dimensions-and-parse-the-elements.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Determine-array-dimensions-and-parse-the-elements.patchDownload+431-600
v4-0003-Re-indent-ArrayCount.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Re-indent-ArrayCount.patchDownload+106-109
v4-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchDownload+230-93
v4-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-53
On Mon, Sep 4, 2023 at 8:00 AM jian he <jian.universality@gmail.com> wrote:
hi.
attached v4.
v4, 0001 to 0005 is the same as v3 in
/messages/by-id/5859ce4e-2be4-92b0-c85c-e1e24eab57c6@iki.fiv4-0006 doing some modifications to address the corner case mentioned
in the previous thread (like select '{{1,},{1},}'::text[]).
also fixed all these FIXME, Heikki mentioned in the code.v4-0007 refactor ReadDimensionInt. to make the array dimension bound
variables within the INT_MIN and INT_MAX. so it will make select
'[21474836488:21474836489]={1,2}'::int[]; fail.
attached, same as v4, but delete unused variable {nitems_according_to_dims}.
Attachments:
v5-0005-Determine-array-dimensions-and-parse-the-elements.patchtext/x-patch; charset=US-ASCII; name=v5-0005-Determine-array-dimensions-and-parse-the-elements.patchDownload+431-600
v5-0001-Simplify-and-speed-up-ReadArrayStr.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Simplify-and-speed-up-ReadArrayStr.patchDownload+33-53
v5-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patchDownload+230-93
v5-0003-Re-indent-ArrayCount.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Re-indent-ArrayCount.patchDownload+106-109
v5-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Extract-loop-to-read-array-dimensions-to-subrouti.patchDownload+152-78
v5-0006-resolve-corner-case.patchtext/x-patch; charset=US-ASCII; name=v5-0006-resolve-corner-case.patchDownload+73-21
v5-0007-refactor-ReadDimensionInt.patchtext/x-patch; charset=US-ASCII; name=v5-0007-refactor-ReadDimensionInt.patchDownload+10-11
Hello Jian,
05.09.2023 02:53, jian he wrote:
On Mon, Sep 4, 2023 at 8:00 AM jian he <jian.universality@gmail.com> wrote:
hi.
attached v4.
v4, 0001 to 0005 is the same as v3 in
/messages/by-id/5859ce4e-2be4-92b0-c85c-e1e24eab57c6@iki.fiv4-0006 doing some modifications to address the corner case mentioned
in the previous thread (like select '{{1,},{1},}'::text[]).
also fixed all these FIXME, Heikki mentioned in the code.v4-0007 refactor ReadDimensionInt. to make the array dimension bound
variables within the INT_MIN and INT_MAX. so it will make select
'[21474836488:21474836489]={1,2}'::int[]; fail.attached, same as v4, but delete unused variable {nitems_according_to_dims}.
Please look at the differences, I've observed with the latest patches
applied, old vs new behavior:
Case 1:
SELECT '{1,'::integer[];
ERROR: malformed array literal: "{1,"
LINE 1: SELECT '{1,'::integer[];
^
DETAIL: Unexpected end of input.
vs
ERROR: malformed array literal: "{1,"
LINE 1: SELECT '{1,'::integer[];
^
(no DETAIL)
Case 2:
SELECT '{{},}'::text[];
ERROR: malformed array literal: "{{},}"
LINE 1: SELECT '{{},}'::text[];
^
DETAIL: Unexpected "}" character
vs
text
------
{}
(1 row)
Case 3:
select '{\{}'::text[];
text
-------
{"{"}
(1 row)
vs
text
------
{""}
Best regards,
Alexander