Re: make_greater_string() does not return a string in some cases

horiguchi.kyotaro@oss.ntt.co.jp

almost 15 years ago

hackersbugs

Hello, Could you let me go on with this topic?

It is hard to ignore this glitch for us using CJK - Chinese,
Japanese, and Korean - characters on databse.. Maybe..

Saying on Japanese under the standard usage, about a hundred
characters out of seven thousand make make_greater_string() fail.

This is not so frequent to happen but also not as rare as
ignorable.

I think this glitch is caused because the method to derive the
`next character' is fundamentally a secret of each encoding but
now it is done in make_greater_string() using the method extended
from that of 1 byte ASCII charset for all encodings together.

So, I think it is reasonable that encoding info table (struct
pg_wchar_tbl) holds the function to do that.

How about this idea?

Points to realize this follows,

- pg_wchar_tbl@pg_wchar.c has new element `charinc' that holds a
function to increment a character of this encoding.

- Basically, the value of charinc is a `generic' increment
function that does what make_greater_string() does in current
implement.

- make_greater_string() now uses charinc for database encoding to
increment characters instead of the code directly written in
it.

- Give UTF-8 a special increment function.

As a consequence of this modification, make_greater_string()
looks somewhat simple thanks to disappearing of the sequence that
handles bare bytes in string. And doing `increment character'
with the knowledge of the encoding can be straightforward and
light and backtrack-free, and have fewer glitches than the
generic method.

# But the process for BYTEAOID remains there dissapointingly.

There still remains some glitches but I think it is overdo to do
conversion that changes the length of the character. Only 5
points out of 17 thousands (in current method, roughly for all
BMP characters) remains, and none of them are not Japanese
character :-)

The attached patch is sample implement of this idea.

What do you think about this patch?

--
Kyotaro Horiguchi
NTT Open Source Software Center

https://commitfest.postgresql.org/action/commitfest_view/open

robertmhaas@gmail.com

almost 15 years ago

In reply to: Kyotaro HORIGUCHI (#1)

hackersbugs

On Fri, Jul 8, 2011 at 5:21 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:

Points to realize this follows,

Please add your patch to the next CommitFest.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

horiguchi.kyotaro@oss.ntt.co.jp

almost 15 years ago

In reply to: Robert Haas (#2)

hackersbugs

Re: [HACKERS] make_greater_string() does not return a string in some cases

Thanks for your suggestion, I'll do so.

At Fri, 8 Jul 2011 23:28:32 -0400, Robert Haas <robertmhaas@gmail.com> wrote:

Please add your patch to the next CommitFest.

https://commitfest.postgresql.org/action/commitfest_view/open

--
Kyotaro Horiguchi
NTT Open Source Software Center

horiguchi.kyotaro@oss.ntt.co.jp

almost 15 years ago

In reply to: Kyotaro HORIGUCHI (#3)

hackersbugs

This is an update of a patch for NEXT CommitFest 2011/09.

Please ignore this message.

1 Additional Feature - EUC-JP incrementer
2 Bug fixes - bytea incrementer, libpq compilation.

--
Kyotaro Horiguchi
NTT Open Source Software Center

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#4)

hackersbugs

[v9.2] make_greater_string() does not return a string in some cases

This is rebased patch of `Allow encoding specific character
incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602).

Addition to the patch, increment sanity check program for new
functions pg_generic_charinc and pg_utf8_increment is attached.

--
Kyotaro Horiguchi
NTT Open Source Software Center

http://archives.postgresql.org/pgsql-bugs/2010-06/msg00303.php

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#5)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Tue, Sep 13, 2011 at 10:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:

This is rebased patch of `Allow encoding specific character
incrementer'(https://commitfest.postgresql.org/action/patch_view?id=602).

Addition to the patch, increment sanity check program for new
functions pg_generic_charinc and pg_utf8_increment is attached.

I took a look at this patch ... and the thread ... and the previous
thread with the same subject line:

As I understand it, the issue here is that when we try to generate
suitable > and < quals for a LIKE expression, we need to find a string
which is greater than the prefix we're searching for, and the existing
algorithm sometimes fails. But there seems to be some dispute over
how likely this is to occur. Tom argues that the case is so rare that
we shouldn't worry about it:

http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php

...while Kyotaro Horiguchi clearly feels otherwise, citing the
statistic that about 100 out of 7000 Japanese characters fail to work
properly:

http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php

That statistic seems to justify some action, but what? Ideas:

1. Adopt the patch as proposed, or something like it.
2. Instead of installing encoding-specific character incrementing
functions, we could try to come up with a more reliable generic
algorithm. Not sure exactly what, though.
3. Come up with some way to avoid needing to do this in the first place.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#6)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Robert Haas <robertmhaas@gmail.com> writes:

As I understand it, the issue here is that when we try to generate
suitable > and < quals for a LIKE expression, we need to find a string
which is greater than the prefix we're searching for, and the existing
algorithm sometimes fails. But there seems to be some dispute over
how likely this is to occur. Tom argues that the case is so rare that
we shouldn't worry about it:
http://archives.postgresql.org/pgsql-bugs/2010-06/msg00336.php

Well, I didn't like the amount of overhead that was implied by that
proposal. I don't have a great deal of difficulty with the idea of
encoding-specific character incrementers, especially since they could
presumably be more efficient than the current brute force approach
wherein we call pg_verifymbstr on the entire string after mangling
just the last byte.

The main risk that I foresee with the proposed approach is that if you
have, say, a 4-byte final character, you could iterate through a *whole
lot* (millions) of larger encoded characters, with absolutely no hope of
making a greater string that way when the determining difference occurs
at some earlier character.  And then when you do run out, you could
waste just as much time at the immediately preceding character, etc etc.
The existing algorithm is a compromise between thoroughness of
investigation of different values of the last character versus speed of
falling back to change the preceding character instead.  I'd be the
first to say that incrementing only the last byte is a very
quick-and-dirty heuristic for making that happen.  But I think it would
be unwise to allow the thing to consider more than a few hundred values
for a character position before giving up.  Possibly the
encoding-specific incrementer could be designed to run through all legal
values of the last byte, then start skipping larger and larger ranges
--- maybe just move directly to incrementing the first byte.

Aside from that issue, the submitted patch seems to need quite a lot of
detail work; for instance, I noted an unconstrained memcpy into a 4-byte
local buffer, as well as lots and lots of violations of PG house style.
That's certainly all fixable but somebody will have to go through it.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

regards, tom lane

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#7)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Wed, Sep 21, 2011 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The main risk that I foresee with the proposed approach is that if you
have, say, a 4-byte final character, you could iterate through a *whole
lot* (millions) of larger encoded characters, with absolutely no hope of
making a greater string that way when the determining difference occurs
at some earlier character.  And then when you do run out, you could
waste just as much time at the immediately preceding character, etc etc.
The existing algorithm is a compromise between thoroughness of
investigation of different values of the last character versus speed of
falling back to change the preceding character instead.  I'd be the
first to say that incrementing only the last byte is a very
quick-and-dirty heuristic for making that happen.  But I think it would
be unwise to allow the thing to consider more than a few hundred values
for a character position before giving up.  Possibly the
encoding-specific incrementer could be designed to run through all legal
values of the last byte, then start skipping larger and larger ranges
--- maybe just move directly to incrementing the first byte.

I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

Aside from that issue, the submitted patch seems to need quite a lot of
detail work; for instance, I noted an unconstrained memcpy into a 4-byte
local buffer, as well as lots and lots of violations of PG house style.
That's certainly all fixable but somebody will have to go through it.

I noticed that, too, but figured we should agree on the basic approach
first. In the first instance, "someone" will hopefully be the patch
author. (Help from anyone else is also welcome, of course... we have
almost 40 patches left to crawl through and, at the moment, very few
reviewers.)

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

Well, right. Without that, there's not much point. But do you think
that's prohibitively difficult?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#8)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Robert Haas <robertmhaas@gmail.com> writes:

I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

Because the behavior of common collation algorithms is so wacky as to
approach stochasticity. In particular, as soon as your string contains
a mix of letter and non-letter characters, "dictionary" algorithms tend
to behave really strangely. We went around on this quite a few times
before settling on the current approach; we kept thinking exactly what
you're thinking, namely that we ought to be able to predict more than
nothing about what the collation algorithm would consider larger.

An example of the weirdness is that in en_US collation, we have

postgres=# select '/root/' < '/root0';
?column?
----------
t
(1 row)

postgres=# select '/root/t' < '/root0';
?column?
----------
f
(1 row)

It was cases like this that forced us to give up on using non-C-locale
indexes for LIKE optimization, because the derived greater-than and
less-than conditions have to be 100% correct for that. What we're still
using make_greater_string for in non-C locales is to generate estimates
about the selectivity of LIKE prefix conditions; for that, some amount
of error is tolerable and so we just live with effects like the above.
(We have to deal with the locale's sort order, whatever the heck it is,
because that's what the pg_statistic histogram is computed in.)

Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. The exactness of
the estimates seems unlikely to be improved very much by doing this.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? �We'd need to do something about the selectivity,
but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

Well, right. Without that, there's not much point. But do you think
that's prohibitively difficult?

The problem is that you'd just be shifting all these same issues into
the btree index machinery, which is not any better equipped to cope with
them, and would not be a good place to be adding overhead.

regards, tom lane

#10

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#4)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Thank you for your understanding on that point.

At Wed, 21 Sep 2011 20:35:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote

...while Kyotaro Horiguchi clearly feels otherwise, citing the
statistic that about 100 out of 7000 Japanese characters fail to work
properly:

http://archives.postgresql.org/pgsql-bugs/2011-07/msg00064.php

That statistic seems to justify some action, but what? Ideas:

Addition to the figures - based on whole characters defined in
JIS X 0208 which is traditionally (It is becoming a history now.)
for information exchange in Japan - narrowing to commonly-used
characters (named `Jouyou-Kanji' in Japanese, to be learned by
high school graduates in Japan), 35 out of 2100 hits.

# On the other hand, widening to JIS X 0213 which is roughly
# compatible with the Unicode, and defines more than 12K chars, I
# have not counted, but the additional 5k characters can be
# assumed to have less probability to fail than the chars in JIS
# X 0208.

1. Adopt the patch as proposed, or something like it.
2. Instead of installing encoding-specific character incrementing
functions, we could try to come up with a more reliable generic
algorithm. Not sure exactly what, though.
3. Come up with some way to avoid needing to do this in the first place.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

Thoughts?

I am a newbie for PostgreSQL, but from a general view, I think
that the most radical and clean way to fix this behavior is to
make indexes to have the forward-matching function for strings in
itself, with ignoreing possible overheads I don't know. This can
save the all failures this patch has left unsaved, assuming that
the `greater string' is not necessary to be a `valid string' just
on searching btree.

Another idea that I can guess is to add a new operator that means
"examine if the string value is smaller than the `greater string'
of the parameter.". This operator also can defer making `greater
string' to just before searching btree or summing up histogram
entries, or comparison with column values. If the assumption
above is true, "making greater string" operation can be done in
regardless of character encoding. This seems have smaller impact
than "prefix match" operator.

# But, mmm, The more investigating, the less difference it seems
# for me to be... But It is out of my knowledge now, anyway.. I
# need more study.

On the other hand, if no additional encoding-specific `character
increment function' will not come out, the modification of
pg_wchar_table can be cancelled and make_greater_string will
select the `character increment function' as 'switch
(GetDatabaseEncoding()) { case PG_UTF8:.. }'. This get rid of
the pg_generic_charinc tweak for libpq too.

At Wed, 21 Sep 2011 21:49:27 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote

detail work; for instance, I noted an unconstrained memcpy into a 4-byte
local buffer, as well as lots and lots of violations of PG house style.
That's certainly all fixable but somebody will have to go through it.

Sorry for the illegal style of the patch. I will confirm it.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Import Notes

Reply to msg id not found: CA+TgmoYO3UbyCekzDTdVWTG2goEQtr7H_f9mous9u_uH_StSA@mail.gmail.com6425.1316656167@sss.pgh.pa.us

#11

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#9)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

[ collations suck ]

Ugh.

Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. The exactness of
the estimates seems unlikely to be improved very much by doing this.

Well, so the problem is that the frequency with which the algorithm
fails altogether seems to be disturbingly high for certain kinds of
characters. I agree it might not be that important to get the
absolutely best next string, but it does seem important not to fail
outright. Kyotaro Horiguchi gives the example of UTF-8 characters
ending with 0xbf.

One random idea I have is - instead of generating > and < clauses,
could we define a "prefix match" operator - i.e. a ### b iff substr(a,
1, length(b)) = b? We'd need to do something about the selectivity,
but I don't see why that would be a problem.

The problem is that you'd need to make that a btree-indexable operator.

Well, right. Without that, there's not much point. But do you think
that's prohibitively difficult?

The problem is that you'd just be shifting all these same issues into
the btree index machinery, which is not any better equipped to cope with
them, and would not be a good place to be adding overhead.

My thought was that it would avoid the need to do any character
incrementing at all. You could just start scanning forward as if the
operator were >= and then stop when you hit the first string that
doesn't have the same initial substring.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Bruce Momjian

bruce@momjian.us

over 14 years ago

In reply to: Robert Haas (#11)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

My thought was that it would avoid the need to do any character
incrementing at all. You could just start scanning forward as if the
operator were >= and then stop when you hit the first string that
doesn't have the same initial substring.

But the whole problem is that not all the strings with the initial
substring are in a contiguous block. The best we can hope for is that
they're fairly dense within a block without too many non-matching
strings. The example with / shows how that can happen.

If you're looking for foo/% and you start with foo/ you'll find:

foo/
foo0
foo/0
foo1
foo/1
...

Even just case-insensitive collations don't put all the strings with a
common prefix in a contiguous block. If you're searching for foo%
you'll find:

foo
Foobar
foobar

--
greg

#13

robertmhaas@gmail.com

over 14 years ago

In reply to: Bruce Momjian (#12)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark@mit.edu> wrote:

On Thu, Sep 22, 2011 at 1:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

My thought was that it would avoid the need to do any character
incrementing at all. You could just start scanning forward as if the
operator were >= and then stop when you hit the first string that
doesn't have the same initial substring.

But the whole problem is that not all the strings with the initial
substring are in a contiguous block. The best we can hope for is that
they're fairly dense within a block without too many non-matching
strings. The example with / shows how that can happen.

If you're looking for foo/% and you start with foo/ you'll find:

foo/
foo0
foo/0
foo1
foo/1
...

Even just case-insensitive collations don't put all the strings with a
common prefix in a contiguous block. If you're searching for foo%
you'll find:

foo
Foobar
foobar

If that were true for the sorts of indexes we're using for LIKE
queries, the existing approach wouldn't work either. All we're doing
is translating:

a LIKE 'foo/%'

a ~=>~ 'foo/%' AND a ~<~ 'foo0'

...where ~=>~ and ~<~ are just text-pattern-ops versions of => and <
that ignore the normal collation rules and just compare bytes.

In general, if we wanted to get rid of text_pattern_ops and make all
of this work with arbitrary indexes, yeah, that would be very
difficult.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#13)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 22, 2011 at 8:59 AM, Greg Stark <stark@mit.edu> wrote:

But the whole problem is that not all the strings with the initial
substring are in a contiguous block.

If that were true for the sorts of indexes we're using for LIKE
queries, the existing approach wouldn't work either.

Right. Since it's not a problem for the sorts of indexes with which we
can use LIKE, moving knowledge of LIKE into the btree machinery doesn't
buy us a darn thing, except more complexity in a place where we can ill
afford it. The essential problem here is "when can you stop scanning,
given a pattern with this prefix?", and btree doesn't know any more
about that than make_greater_string does; it would in fact have to use
make_greater_string or something isomorphic to it.

regards, tom lane

#15

Bruce Momjian

bruce@momjian.us

over 14 years ago

In reply to: Tom Lane (#14)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The essential problem here is "when can you stop scanning,
given a pattern with this prefix?", and btree doesn't know any more
about that than make_greater_string does; it would in fact have to use
make_greater_string or something isomorphic to it.

Hm, as long as btree_pattern_ops is the only opclass that behaves this
way that's more or less true. But Robert's right that if btree just
stops when it finds something that doesn't match it doesn't need to
hard code any knowledge of what the "next" value would be. If there
were any other op classes that had this abstract property of always
putting strings with common prefixes in a contiguous block then it
would continue to work without having to know where to find the
boundaries of that contiguous block.

Just as an example, if you had a pattern_ops opclass that sorted the
string assuming it was in some other encoding like, say, EBCDIC, then
make_greater_string would have to learn about it but Robert's model
would just work.

This isn't enitirely facetious. Sorting by EBCDIC ordering would be
silly but I vague recall there being some examples that wouldn't be
silly. And perhaps some collations could actually be marked as being
acceptable even if they don't sort in pure ascii ordering and
make_greater_string doesn't actually know about them.

#16

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#11)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Sep 22, 2011 at 12:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. �The exactness of
the estimates seems unlikely to be improved very much by doing this.

Well, so the problem is that the frequency with which the algorithm
fails altogether seems to be disturbingly high for certain kinds of
characters. I agree it might not be that important to get the
absolutely best next string, but it does seem important not to fail
outright. Kyotaro Horiguchi gives the example of UTF-8 characters
ending with 0xbf.

[ thinks for a bit ... ] Yeah, it's certainly true that such a
character might be relatively small in the overall sort order. The
assumption underlying what we're doing now is that dropping the last
character and incrementing the next-to-last one instead isn't terribly
catastrophic from an estimation accuracy standpoint. I can see that
there are cases where that would fail to be true, but I'm not exactly
convinced that they're worse than all the other cases where we'll get
a poor estimate.

Anyway, I won't stand in the way of the patch as long as it's modified
to limit the number of values considered for any one character position
to something reasonably small.

regards, tom lane

#17

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Bruce Momjian (#15)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Greg Stark <stark@mit.edu> writes:

On Thu, Sep 22, 2011 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The essential problem here is "when can you stop scanning,
given a pattern with this prefix?", and btree doesn't know any more
about that than make_greater_string does; it would in fact have to use
make_greater_string or something isomorphic to it.

Hm, as long as btree_pattern_ops is the only opclass that behaves this
way that's more or less true. But Robert's right that if btree just
stops when it finds something that doesn't match it doesn't need to
hard code any knowledge of what the "next" value would be.

But you've added mechanism (and hence cycles) to btree searches,
and *you haven't actually gained anything*. If the feature is
restricted to only work for sort orderings in which common-prefix
strings are contiguous, then it doesn't do anything we can't do
just as well with the existing mechanism. Moreover, you'll still
need make_greater_string because of the problem of trying to extract
LIKE selectivity estimates from locale-dependent pg_statistic data.

regards, tom lane

#18

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#16)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anyway, I won't stand in the way of the patch as long as it's modified
to limit the number of values considered for any one character position
to something reasonably small.

One thing I was thinking about is that it would be useful to have some
metric for judging how well any given algorithm that we might pick
here actually works. For example, if we were to try all possible
three character strings in some encoding and run make_greater_string()
on each one of them, we could then measure the failure percentage. Or
if that's too many cases to crank through then we could limit it some
way - but the point is, without some kind of test harness here, we
have no way of measuring the trade-off between spending more CPU time
and improving accuracy. Maybe you have a better feeling for what's
reasonable there than I do, but I'm not prepared to take a stab in the
dark without benefit of some real measurements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#18)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

Robert Haas <robertmhaas@gmail.com> writes:

One thing I was thinking about is that it would be useful to have some
metric for judging how well any given algorithm that we might pick
here actually works.

Well, the metric that we were indirectly using earlier was the
number of characters in a given locale for which the algorithm
fails to find a greater one (excluding whichever character is "last",
I guess, or you could just recognize there's always at least one).

For example, if we were to try all possible
three character strings in some encoding and run make_greater_string()
on each one of them, we could then measure the failure percentage. Or
if that's too many cases to crank through then we could limit it some
way -

Even in UTF8 there's only a couple million assigned code points, so for
test purposes anyway it doesn't seem like we couldn't crank through them
all. Also, in many cases you could probably figure it out by analysis
instead of brute-force testing every case.

A more reasonable objection might be that a whole lot of those code
points are things nobody cares about, and so we need to weight the
results somehow by the actual popularity of the character. Not sure
how to take that into account.

Another issue here is that we need to consider not just whether we find
a greater character, but "how much greater" it is. This would apply to
my suggestion of incrementing the top byte without considering
lower-order bytes --- we'd be skipping quite a lot of code space for
each increment, and it's conceivable that that would be quite hurtful in
some cases. Not sure how to account for that either. An extreme
example here is an "incrementer" that just immediately returns the last
character in the sort order for any lesser input.

regards, tom lane

#20

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#19)

hackersbugs

Re: [v9.2] make_greater_string() does not return a string in some cases

On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, the metric that we were indirectly using earlier was the
number of characters in a given locale for which the algorithm
fails to find a greater one (excluding whichever character is "last",
I guess, or you could just recognize there's always at least one).

What about characters that sort differently in sequence than individually?

For example, if we were to try all possible
three character strings in some encoding and run make_greater_string()
on each one of them, we could then measure the failure percentage. Or
if that's too many cases to crank through then we could limit it some
way -

Even in UTF8 there's only a couple million assigned code points, so for
test purposes anyway it doesn't seem like we couldn't crank through them
all. Also, in many cases you could probably figure it out by analysis
instead of brute-force testing every case.

A more reasonable objection might be that a whole lot of those code
points are things nobody cares about, and so we need to weight the
results somehow by the actual popularity of the character. Not sure
how to take that into account.

I guess whether that's a problem in practice will depend somewhat on
the quality of the algorithms we're able to find. If our best
algorithms still have a 1% failure rate, then yeah, that's an issue,
but in that case I'd suggest that our best algorithms suck and we need
to think harder about alternate solutions. If we're talking about
failing on 5 characters out of a million we can just eyeball them.
I'm not trying to reduce this testing to something that is entirely
mechanic in every way; I'm just saying that I'm not optimistic about
my ability to judge which algorithms will work best in practice
without some kind of automated aid.

Another issue here is that we need to consider not just whether we find
a greater character, but "how much greater" it is. This would apply to
my suggestion of incrementing the top byte without considering
lower-order bytes --- we'd be skipping quite a lot of code space for
each increment, and it's conceivable that that would be quite hurtful in
some cases. Not sure how to account for that either. An extreme
example here is an "incrementer" that just immediately returns the last
character in the sort order for any lesser input.

Right... well, this is why I'm not wild about doing this by
incrementing in the first place.

But now that I think about it, what about using some
slightly-less-stupid version of that approach as a fallback strategy?
For example, we could pick, oh, say, 20 characters out of the space of
code points, about evenly distributed under whatever collations we
think are likely to be in use. In the incrementer, we try some kind
of increment-the-bytes strategy for a while and if it doesn't pan out,
we zip through the array and try substituting each of the fallback
characters. If more than one works, we test the survivors against
each other until we're left with just one winner. The bound might not
be real tight, but as long as it's good enough to make the planner
pick an index scan it might not matter very much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#20)

hackersbugs

#22

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Tom Lane (#21)

hackersbugs

#23

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#16)

hackersbugs

#24

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#22)

hackersbugs

#25

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#23)

hackersbugs

#26

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#25)

hackersbugs

#27

marcin mank

marcin.mank@gmail.com

over 14 years ago

In reply to: Robert Haas (#26)

hackersbugs

#28

Peter Eisentraut

peter_e@gmx.net

over 14 years ago

In reply to: marcin mank (#27)

hackersbugs

#29

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Peter Eisentraut (#28)

hackersbugs

#30

Peter Eisentraut

peter_e@gmx.net

over 14 years ago

In reply to: Tom Lane (#29)

hackersbugs

#31

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Peter Eisentraut (#30)

hackersbugs

#32

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#5)

hackersbugs

#33

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#32)

hackersbugs

#34

robertmhaas@gmail.com

over 14 years ago

In reply to: Robert Haas (#33)

hackersbugs

#35

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Robert Haas (#34)

hackersbugs

#36

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#35)

hackersbugs

#37

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Robert Haas (#36)

hackersbugs

#38

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#37)

hackersbugs

#39

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#38)

hackersbugs

#40

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#39)

hackersbugs

#41

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#40)

hackersbugs

#42

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#41)

hackersbugs

#43

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#42)

hackersbugs

#44

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Robert Haas (#41)

hackersbugs

#45

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#44)

hackersbugs

#46

horiguchi.kyotaro@oss.ntt.co.jp

over 14 years ago

In reply to: Robert Haas (#45)

hackersbugs

#47

robertmhaas@gmail.com

over 14 years ago

In reply to: Kyotaro HORIGUCHI (#46)

hackersbugs

#48

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#47)

hackersbugs

#49

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#48)

hackersbugs

#50

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#49)

hackersbugs

#51

robertmhaas@gmail.com

over 14 years ago

In reply to: Tom Lane (#50)

hackersbugs

#52

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#51)

hackersbugs

#53