Doing better at HINTing an appropriate column within errorMissingColumn()
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.
I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.
Consider the following sample sessions of mine, made with the
dellstore2 sample database:
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)
[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.
I have some outstanding concerns about the proposed patch:
* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.
* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.
Thoughts?
--
Peter Geoghegan
Attachments:
Hello
I see only one risk - it can do some slowdown of exception processing.
Sometimes you can have a code like
BEGIN
WHILE ..
LOOP
BEGIN
INSERT INTO ...
EXCEPTION WHEN ..
; /* ignore this error */
END;
END LOOP;
without this risks, proposed feature is nice, but should be fast
Regards
Pavel
2014-03-27 20:10 GMT+01:00 Peter Geoghegan <pg@heroku.com>:
Show quoted text
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.Consider the following sample sessions of mine, made with the
dellstore2 sample database:[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.I have some outstanding concerns about the proposed patch:
* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.Thoughts?
--
Peter Geoghegan--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I see only one risk - it can do some slowdown of exception processing.
I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2014-03-28 9:22 GMT+01:00 Peter Geoghegan <pg@heroku.com>:
On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:I see only one risk - it can do some slowdown of exception processing.
I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.
if it is related to ERRCODE_UNDEFINED_COLUMN then it should be ok (from
performance perspective)
but second issue can be usage from plpgsql - where is mix SQL identifiers
and plpgsql variables.
Pavel
Show quoted text
--
Peter Geoghegan
Very interesting idea, I'd think about optionally add similarity hinting
support to psql tab. With, say, 80% of similarity matching, it
shouldn't be very annoying. For interactive usage there is no risk of
slowdown.
On Mar 27, 2014 11:11 PM, "Peter Geoghegan" <pg@heroku.com> wrote:
Show quoted text
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.Consider the following sample sessions of mine, made with the
dellstore2 sample database:[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.I have some outstanding concerns about the proposed patch:
* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.Thoughts?
--
Peter Geoghegan--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name.
[local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwim
Maybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.
Yours,
Laurenz Albe
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: Albe Laurenz 2014-03-28 <A737B7A37273E048B164557ADEF4A58B17CE8DEA@ntex2010i.host.magwien.gv.at>
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwimMaybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.
I find it very annoying when zsh asks me "did you mean foo [y/n]" and
I need to confirm that, but I'd find a mere HINT that I can easily
ignore a very useful feature. +1 for the idea.
Christoph
--
cb@df7cb.de | http://www.df7cb.de/
On Fri, Mar 28, 2014 at 4:10 AM, Peter Geoghegan <pg@heroku.com> wrote:
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.Consider the following sample sessions of mine, made with the
dellstore2 sample database:[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.
What about the overhead that this processing creates if error
processing needs to scan a schema with let's say hundreds of tables?
* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.
As long as Hiragana-only words (basic alphabet for Japanese words),
and more particularly Katakana only-words (to write phonetically
foreign words) are compared (even Kanji-only things compared),
Levenstein could play its role pretty well. But once a comparison is
made with two words using different alphabet, well Levenstein is not
going to work well. A simple example is 'ramen' (Japanese noodles),
that you can find written sometimes in Hiragana, or even in Katakana,
and here Levenstein performs poorly:
=# select levenshtein('ラーメン', 'らあめん');
levenshtein
-------------
4
(1 row)
Regards,
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 28, 2014 at 5:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
What about the overhead that this processing creates if error
processing needs to scan a schema with let's say hundreds of tables?
It doesn't work that way. I've extended searchRangeTableForCol() so
that when it calls scanRTEForColumn(), it considers Levenshtein
distance, and not just plain string equality, which is what happens
today. The code only looks through ParseState.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 28, 2014 at 4:47 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Peter Geoghegan wrote:
With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name.[local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwimMaybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.
It's not really DWIM, because the backend is still throwing an error.
It's just trying to help you sort out the error, along the way.
Still, I share some of your discomfort. I see Peter's patch as an
example of a broader class of things that we could do - but I'm not
altogether sure that we want to do them. There's a risk of adding not
only CPU cycles but also clutter. If we do things that encourage
people to crank the log verbosity down, I think that's going to be bad
more often than it's good. It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 1, 2014 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
There's a risk of adding not
only CPU cycles but also clutter. If we do things that encourage
people to crank the log verbosity down, I think that's going to be bad
more often than it's good.
While I share your concern here, I think that this is something that
is only likely to be seen in an interactive psql session, where it is
seen quite frequently. I am reasonably confident that it's highly
unusual to see ERRCODE_UNDEFINED_COLUMN in other settings. Not having
to do a mental context switch when writing an ad-hoc query has
considerable value. Even C compilers like Clang have this kind of
feedback. This is a patch that was written out of personal
frustration with the experience of interacting with many different
databases. Things like the Python REPL don't do so much of this kind
of thing, but presumably that's because of Python's dynamic typing.
This is a HINT that can be given with fairly high confidence that
it'll be helpful - there just won't be that many things that the user
could have meant to choose from. I think it's even useful when the
suggested column is distant from the original suggestion (i.e.
errorMissingColumn() offers only what is clearly a "wild guess"),
because then the user knows that he or she has got it quite wrong.
Frequently, this will be because the wrong synonym for what should
have been written was used.
It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.
Fair enough.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/1/14, 1:04 PM, Peter Geoghegan wrote:
It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.Fair enough.
I agree with the concern, but also have to say that I can't count how many times I could have used this. A big +1, at least in this case.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lot
It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.
--
greg
On 1 Apr 2014 17:38, "Jim Nasby" <jim@nasby.net> wrote:
Show quoted text
On 4/1/14, 1:04 PM, Peter Geoghegan wrote:
It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.Fair enough.
I agree with the concern, but also have to say that I can't count how many
times I could have used this. A big +1, at least in this case.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-04-02 21:08:47 +0100, Greg Stark wrote:
Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lot
I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.
It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.
Really? Should we now add GUCs for every feature then?
Greetings,
Andres Freund
PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.
I agree. It's just a HINT. It's quite helpful in certain particular
contexts, but in the grand scheme of things isn't all that important.
I am being quite conservative in trying to anticipate cases where on
balance it'll actually hurt more than it will help. I doubt that there
actually are any.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:
PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.
I've been responding a lot from the phone. Unfortunately the Gmail client
on the phone makes it nearly impossible to format messages well. I'm
beginning to think it would be better to just not quote at all any more.
I'm normally not doing a point-by-point response anyways.
--
greg
On 2014-04-03 00:48:12 -0400, Greg Stark wrote:
On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:
PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.I've been responding a lot from the phone. Unfortunately the Gmail client
on the phone makes it nearly impossible to format messages well. I'm
beginning to think it would be better to just not quote at all any more.
I'm normally not doing a point-by-point response anyways.
I really don't care where you're answering from TBH. It's unreadable,
misses context and that's it. If $device doesn't work for you, don't use
it.
I don't mind an occasional quick answer that's badly formatted, but for
other things it's really annoying.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 04/02/2014 01:16 PM, Andres Freund wrote:
On 2014-04-02 21:08:47 +0100, Greg Stark wrote:
Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lotI don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.Really? Should we now add GUCs for every feature then?
-1 for having a GUC for this.
+1 on the feature.
Review with functional test coming up.
Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMfa5058a0cf78e105e4943455e049a6e231537de5b73a550f4e6fbb7aa0182ba1833f03f49d84ffc0ab7bb7c5a90e7f19@asav-1.01.com
On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:
Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?
I think that how useful Levenshtein distance is for users based in
east Asia generally, and how useful this patch is to those users are
two distinct questions. I have no idea how common it is for Japanese
users to just use Roman characters as table and attribute names. Since
they're very probably already writing application code that uses Roman
characters (except in the comments, user strings and so on), it might
make sense to do the same in the database. I would welcome further
input on that question. I don't know what the trends are in the real
world.
Also note that the patch scans the range table parse state to pick the
most probable candidate among all Vars/columns that already appear
there. The query would raise an error at an earlier point if a
non-existent relation was referenced, for example. We're only choosing
from a minimal list of possibilities, and pick one that is very
probably what was intended. Even if Levenshtein distance works badly
with Kanji (which is not obviously the case, at least to me), it might
not matter here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/06/17 8:31, Peter Geoghegan wrote:
On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:
Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?I think that how useful Levenshtein distance is for users based in
east Asia generally, and how useful this patch is to those users are
two distinct questions. I have no idea how common it is for Japanese
users to just use Roman characters as table and attribute names. Since
they're very probably already writing application code that uses Roman
characters (except in the comments, user strings and so on), it might
make sense to do the same in the database. I would welcome further
input on that question. I don't know what the trends are in the real
world.
From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.
Regards
Ian Barwick
--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.
And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.
And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..
What exactly is the rationale for thinking that Levenshtein distance is
useless in non-Roman alphabets? AFAIK it just counts insertions and
deletions of characters, which seems like a concept rather independent
of what those characters are.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/06/17 9:53, Tom Lane wrote:
Michael Paquier <michael.paquier@gmail.com> writes:
On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:
From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..What exactly is the rationale for thinking that Levenshtein distance is
useless in non-Roman alphabets? AFAIK it just counts insertions and
deletions of characters, which seems like a concept rather independent
of what those characters are.
With Japanese (which doesn't have an alphabet, but two syllabaries and
a bunch of logographic characters), Levenshtein distance is pretty useless
for examining similarities with words which can be written in either
syllabary (Michael's "ramen" example earlier in the thread); and when
catching "typos" caused by erroneous conversion from phonetic input to
characters - e.g. intending to input "成長" (seichou, growth) but
accidentally selecting "清聴" (seichou, courteous attention).
Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.
Regards
Ian Barwick
--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.
I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14/06/17 11:57, Peter Geoghegan wrote:
On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.
I mean, does it come up with a suggestion in every case, even if there is
no remotely similar column? E.g. would
SELECT foo FROM some_table
bring up column "bar" as a suggestion if "bar" is the only column in
the table?
Anyway, is there an up-to-date version of the patch available? The one from
March doesn't seem to apply cleanly to HEAD.
Thanks
Ian Barwick
--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> writes:
On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.
I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.
Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.
You could possibly answer this objection by making the HINT list *all*
the alternatives meeting the minimum Levenshtein distance. But I think
that's probably overcomplicated and of uncertain value anyhow. I'd rather
have a rule that "we print only the choice that is at least K units better
than any other choice", where K remains to be determined exactly.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 16, 2014 at 8:38 PM, Ian Barwick <ian@2ndquadrant.com> wrote:
I mean, does it come up with a suggestion in every case, even if there is
no remotely similar column? E.g. wouldSELECT foo FROM some_table
bring up column "bar" as a suggestion if "bar" is the only column in
the table?
Yes, it would, but I think that's the correct behavior.
Anyway, is there an up-to-date version of the patch available? The one from
March doesn't seem to apply cleanly to HEAD.
Are you sure? I think it might just be that patch is confused about
the deleted file contrib/fuzzystrmatch/levenshtein.c.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.
I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from. I think that it might be particularly
annoying to not show some suggestion in the event of a would-be
ambiguous column reference where the column name is itself wrong,
since both mistakes are common. For example, "order_id" was specified
instead of one of either "o.orderid" or "ol.orderid", as in my
original examples. If some correct alias was specified, that would
make the new code prefer the appropriate Var, but it might not be, and
that should be okay in my view.
I'm not trying to remove the need for human judgement here. We've all
heard stories about people who did things like input "Portland" into
their GPS only to end up in Maine rather than Oregon, but I think in
general you can only go so far in worrying about those cases.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from. I think that it might be particularly
annoying to not show some suggestion in the event of a would-be
ambiguous column reference where the column name is itself wrong,
since both mistakes are common. For example, "order_id" was specified
instead of one of either "o.orderid" or "ol.orderid", as in my
original examples. If some correct alias was specified, that would
make the new code prefer the appropriate Var, but it might not be, and
that should be okay in my view.I'm not trying to remove the need for human judgement here. We've all
heard stories about people who did things like input "Portland" into
their GPS only to end up in Maine rather than Oregon, but I think in
general you can only go so far in worrying about those cases.
Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.
One existing tool that does something along these lines is 'git',
which seems to have some kind of a heuristic to know when to give up:
[rhaas pgsql]$ git gorp
git: 'gorp' is not a git command. See 'git --help'.
Did you mean this?
grep
[rhaas pgsql]$ git goop
git: 'goop' is not a git command. See 'git --help'.
Did you mean this?
grep
[rhaas pgsql]$ git good
git: 'good' is not a git command. See 'git --help'.
[rhaas pgsql]$ git puma
git: 'puma' is not a git command. See 'git --help'.
Did you mean one of these?
pull
push
I suspect that the maximum useful distance is a function of the string
length. Certainly, if the distance is greater than or equal to the
length of one of the strings involved, it's just a totally unrelated
string and thus not worth suggesting. A useful heuristic might be
something like "distance at most 3, or at most half the string length,
whichever is less".
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:
I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from.
Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.
Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.
One existing tool that does something along these lines is 'git',
which seems to have some kind of a heuristic to know when to give up:
I wouldn't necessarily hold up git as a model of user interface
engineering ;-) ... but still, it might be interesting to take a look
at exactly what heuristics they used here. I'm sure there are other
precedents we could look at, too.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wouldn't necessarily hold up git as a model of user interface
engineering ;-) ... but still, it might be interesting to take a
look at exactly what heuristics they used here. I'm sure there
are other precedents we could look at, too.
On my Ubuntu machine, bash does something similar. A few examples
chosen completely arbitrarily:
kgrittn@Kevin-Desktop:~$ got
No command 'got' found, did you mean:
Command 'go' from package 'golang-go' (universe)
Command 'gout' from package 'scotch' (universe)
Command 'jot' from package 'athena-jot' (universe)
Command 'go2' from package 'go2' (universe)
Command 'git' from package 'git' (main)
Command 'gpt' from package 'gpt' (universe)
Command 'gom' from package 'gom' (universe)
Command 'goo' from package 'goo' (universe)
Command 'gst' from package 'gnu-smalltalk' (universe)
Command 'dot' from package 'graphviz' (main)
Command 'god' from package 'god' (universe)
Command 'god' from package 'ruby-god' (universe)
got: command not found
kgrittn@Kevin-Desktop:~$ groupad
No command 'groupad' found, did you mean:
Command 'groupadd' from package 'passwd' (main)
Command 'groupd' from package 'cman' (main)
groupad: command not found
kgrittn@Kevin-Desktop:~$ asdf
No command 'asdf' found, did you mean:
Command 'asdfg' from package 'aoeui' (universe)
Command 'sadf' from package 'sysstat' (main)
Command 'sdf' from package 'sdf' (universe)
asdf: command not found
kgrittn@Kevin-Desktop:~$ zxcv
zxcv: command not found
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/17/2014 01:59 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.
Well, there's two different issues:
(1) offering a suggestion which is too different from what the user
typed. This is easily limited by having a max distance (most likely a
distance/length ratio, with a max of say, 0.5). The only drawback of
this would be the extra cpu cycles to calculate it, and some arguments
about what the max distance should be. But for the sake of the
children, let's not have a GUC for it.
(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM2549150955e3d989339282039051192025203b1606d54008552a93abfac64950773f2d8a538df26ba7c0c10079bee3bf@asav-2.01.com
On Tue, Jun 17, 2014 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.
Maybe that's just a matter of phrasing the message appropriately. A
more guarded message, that suggests that "foobar" is the *best* match
is correct at least on its own terms (terms that are self evident).
This does pretty effectively communicate to the user that they should
totally rethink not just the column name, but perhaps the entire
query. On the other hand, showing nothing communicates nothing.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:
(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.
I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).
Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/17/2014 02:36 PM, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.
Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.
So if we want to list everything below a cutoff, we'd need to make that
cutoff fairly narrow, like 0.2. But that means we'd miss a lot of
potential matches on short column names.
I really think we're overthinking this: it is just a HINT, and we can
improve it in future PostgreSQL versions, and most of our users will
ignore it anyway because they'll be using a client which doesn't display
HINTs.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM0ee38c330be2adaf48cbefc8dfe2cf807f9ce5169da412d66f6eb76f425b9234dc92778e705220bbdff6ff858febf0b0@asav-2.01.com
Peter Geoghegan <pg@heroku.com> writes:
Maybe that's just a matter of phrasing the message appropriately. A
more guarded message, that suggests that "foobar" is the *best* match
is correct at least on its own terms (terms that are self evident).
This does pretty effectively communicate to the user that they should
totally rethink not just the column name, but perhaps the entire
query. On the other hand, showing nothing communicates nothing.
I don't especially buy that argument. As soon as the user's gotten used
to hints of this sort, the absence of a hint communicates plenty.
In any case, people have now cited two different systems with suggestion
capability, and neither of them behaves as you're arguing for. The lack
of precedent should give you pause, unless you can point to widely-used
systems that do what you have in mind.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:
On 06/17/2014 02:36 PM, Tom Lane wrote:
Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.
Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.
I think Levenshtein distances are integers, though that's just a minor
point.
So if we want to list everything below a cutoff, we'd need to make that
cutoff fairly narrow, like 0.2. But that means we'd miss a lot of
potential matches on short column names.
I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.
I really think we're overthinking this: it is just a HINT, and we can
improve it in future PostgreSQL versions, and most of our users will
ignore it anyway because they'll be using a client which doesn't display
HINTs.
Agreed that we can make it better later. But whether it prints exactly
one suggestion, and whether it does that no matter how silly the
suggestion is, are rather fundamental decisions.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/17/2014 02:53 PM, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
On 06/17/2014 02:36 PM, Tom Lane wrote:
Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.I think Levenshtein distances are integers, though that's just a minor
point.
I was giving distance/length ratios. That is, 0.5 would mean that up to
50% of the characters could be replaced/changed. 0.2 would mean that
only one character could be changed at lengths of five characters. Etc.
The problem with these ratios is that they behave differently with long
strings than short ones. I think realistically we'd need a double
threshold, i.e. ( distance >= 2 OR ratio <= 0.4 ). Otherwise the
obvious case, getting two characters wrong in a 4-character column name
(or one in a two character name), doesn't get a HINT.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMef6165b8775cdab0996e6916b795020d76e5871d16e80e67890fe5e7bf0f51d3b515d841391e07ee0b099651e9c1d65a@asav-3.01.com
On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.
The patch just looks for the match with the lowest distance, passing
the lowest observed distance so far as a "max" to the distance
calculation function. That could have some value in certain cases.
People have already raised general concerns about added cycles and/or
clutter.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 18/06/14 10:05, Peter Geoghegan wrote:
On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.The patch just looks for the match with the lowest distance, passing
the lowest observed distance so far as a "max" to the distance
calculation function. That could have some value in certain cases.
People have already raised general concerns about added cycles and/or
clutter.
How about a list of miss spellings and the likely targets.
(grop, grap, ...) ==> (grep, grape, grope...)
type of thing? Possibly with some kind of adaptive learning algorithm.
I suspect that while this might be a useful research project, it is out
of scope for the current discussion!
Cheers,
Gavin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 5:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josh Berkus <josh@agliodbs.com> writes:
(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).
What bash does is annoying and stupid, and any time I find a system
with that obnoxious behavior enabled I immediately disable it, so I
don't consider that a good precedent for anything. I think what the
bash algorithm demonstrates is that while it may be sane to list more
than one option, listing 10 or 20 or 150 is unbearably obnoxious.
Filling the user's *entire terminal window* with a list of suggestions
when they make a minor typo is more like a punishment than an aid.
git's behavior of limiting itself to one or two options, while
somewhat useless, is at least not annoying.
Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.
Well, we've got lots of heuristics. Many of them serve us quite well.
I might do something like this:
(1) Set the maximum levenshtein distance to half the length of the
string, rounded down, but not more than 3.
(2) If there are more than 2 matches, reduce the maximum distance by 1
and repeat this step.
(3) If there are no remaining matches, print no hint; else print the 1
or 2 matching items.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 5:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
What bash does is annoying and stupid, and any time I find a system
with that obnoxious behavior enabled I immediately disable it, so I
don't consider that a good precedent for anything.
I happen to totally agree with you here. Bash sometimes does awful
things with its completion.
Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.Well, we've got lots of heuristics. Many of them serve us quite well.
I might do something like this:(1) Set the maximum levenshtein distance to half the length of the
string, rounded down, but not more than 3.
(2) If there are more than 2 matches, reduce the maximum distance by 1
and repeat this step.
(3) If there are no remaining matches, print no hint; else print the 1
or 2 matching items.
I could do that. I can prepare a revision if others feel that's
acceptable. My only concern with this is that a more sophisticated
scheme implies more clutter in the parser, although it should not
imply wasted cycles.
What I particularly wanted to avoid in our choice of completion scheme
is doing nothing because there is an ambiguity about what is best,
which Tom suggested. In practice, that ambiguity will frequently be
something that our users will not care about, and not really see as an
ambiguity, as in my "o.orderid or ol.orderid?" example. However, if
there are 3 equally distant Vars, and not just 2, that's very probably
because none are useful, and so we really ought to show nothing. This
seems most sensible.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
So, what's the status of this patch?
There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
So, what's the status of this patch?
There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.
AFAICT, people generally agree that this would probably be useful,
but there's not consensus on how far the code should be willing to
"reach" for a match, nor on what to do when there are multiple
roughly-equally-plausible candidates.
Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that. I still
think though that printing candidates with very large distances
would be unhelpful.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that. I still
think though that printing candidates with very large distances
would be unhelpful.
Attached revision factors in everyone's concerns here, I think.
I've addressed your concern about the closeness of the match proposed
in the HINT - the absolute as opposed to relative quality of the
match. There is a normalized distance threshold that must always be
exceeded to prevent ludicrous suggestions. This works along similar
lines to those sketched by Robert. Furthermore, I've made it
occasionally possible to see 2 suggestions, when they're equally
distant and when each suggestion comes from a different range table
entry. However, if the two best suggestions (overall or within an RTE)
come from within the same RTE, then that RTE is ignored for the
purposes of picking a suggestion (although the lowest observed
distance from an ignored RTE may still be used as the distance for
later RTEs to beat to get their attributes suggested in the HINT).
The idea here is that this quality-bar for suggestions doesn't come at
the cost of ignoring my concern about the presumably somewhat common
case where there is an unqualified and therefore ambiguous column
reference that happens to also be misspelled. An ambiguous column
reference and an incorrectly spelled column name are both very common,
and so it seems likely that momentary lapses where the user gets both
things wrong at once are also common. We do all this without going
overboard, since as outlined by Robert, when there are 3 or more
equally distant candidates (even if they all come from different
RTEs), we give no HINT at all. The big picture here is to make mental
context switches cheap when writing ad-hoc queries in psql.
A lot of the HINTs that popped up in the regression tests that seemed
kind of questionable no longer appear. These new measures make the
coding somewhat more complex than that of the initial version,
although overall the parser code added by this patch is almost
entirely confined to code paths concerned only with producing
diagnostic messages to help users.
--
Peter Geoghegan
Attachments:
At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:
Attached revision factors in everyone's concerns here, I think.
Is anyone planning to review Peter's revised patch?
These new measures make the coding somewhat more complex than that of
the initial version, although overall the parser code added by this
patch is almost entirely confined to code paths concerned only with
producing diagnostic messages to help users.
Yes, the new patch looks quite a bit more involved than earlier, but if
that's what it takes to provide a useful HINT, I guess it's not too bad.
-- Abhijit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:
Attached revision factors in everyone's concerns here, I think.
Is anyone planning to review Peter's revised patch?
I have been doing some functional tests, and looked quickly at the
code to understand what it does:
1) Compiles without warnings, passes regression tests
2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.
3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.
4) This is not nice, could it be possible to remove the stuff from varlena.c?
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
5) Do we want hints on system columns as well? For example here we
could get tableoid as column hint:
=# select tablepid from foo;
ERROR: 42703: column "tablepid" does not exist
LINE 1: select tablepid from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123
Time: 0.425 ms
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123
7) Performance penalty with a table with 1600 columns:
=# CREATE FUNCTION create_long_table(tabname text, columns int)
RETURNS void
LANGUAGE plpgsql
as $$
declare
first_col bool = true;
count int;
query text;
begin
query := 'CREATE TABLE ' || tabname || ' (';
for count in 0..columns loop
query := query || 'col' || count || ' int';
if count <> columns then
query := query || ', ';
end if;
end loop;
query := query || ')';
execute query;
end;
$$;
=# SELECT create_long_table('aa', 1599);
create_long_table
-------------------
(1 row)
Then tested queries like that: SELECT col888a FROM aa;
Patched version: 2.100ms~2.200ms
master branch (6048896): 0.956 ms~0.990 ms
So the performance impact seems limited.
Regards,
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.
In general it's possible that an exact match will later be found
within the RTE, and exact matches don't have to pay the "wrong alias"
penalty, and are immediately returned. It is therefore not a waste of
resources, but even if it was that would be pretty inconsequential as
your benchmark shows.
3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.
I don't think it's a good idea to tie distanceName() to the ultimate
behavior of errorMissingColumn() hinting, since there may be other
callers in the future. Besides, that isn't going to help much.
4) This is not nice, could it be possible to remove the stuff from varlena.c? +/* Expand each Levenshtein distance variant */ +#include "levenshtein.c" +#define LEVENSHTEIN_LESS_EQUAL +#include "levenshtein.c" +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.
5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123
That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.
It seems pretty silly to me actually. Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...
Words fail me.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.It seems pretty silly to me actually. Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...
I guess it was designed by a committee. But we don't fail to show both
because they're equally distant. Rather, it's because they're equally
distant and from the same RTE. This is a contrived example, but
typically showing equally distant columns is useful when they're in a
foreign-key relationship - I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't. I think that
in most realistic cases it wouldn't be all that useful to show two
columns from the same table when they're equally distant. It's easy to
imagine that reflecting that no match is good in absolute terms, and
we're somewhat conservative about showing any match. While I think
this general behavior is defensible, I must admit that it did suit me
to write it that way because to do otherwise would have necessitated
more invasive code in the existing general purpose scanRTEForColumn()
function.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't
To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn'tTo be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.
Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
--
Michael
On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan <pg@heroku.com> wrote:
4) This is not nice, could it be possible to remove the stuff from
varlena.c?
+/* Expand each Levenshtein distance variant */ +#include "levenshtein.c" +#define LEVENSHTEIN_LESS_EQUAL +#include "levenshtein.c" +#undef LEVENSHTEIN_LESS_EQUAL Part of the same comment: only varstr_leven_less_equal is used to calculate the distance, should we really move varstr_leven to core? This clearly needs to be reworked as not just a copy-paste of the things in fuzzystrmatch. The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch Ithink.
So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.
No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this
is necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?
5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.
This may sound ridiculous, but I have already found myself mistyping ctid
by tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on
(bad morning). I would have welcomed such hints in those cases.
--
Michael
On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this is
necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?
Just as before, rest_of_char_same() exists for the express purpose of
being called by the two variants varstr_leven_less_equal() and
varstr_leven(). Why wouldn't I copy it over too along with those two?
Where do you propose to put it?
Obviously the existing macro hacks (that I haven't changed) that build
the two variants are not terribly pretty, but they're not arbitrary
either. They reflect the fact that there is no natural way to add
callbacks or something like that. If you pretended that the core code
didn't have to care about one case or the other, and that contrib was
somehow obligated to hook in its own handler for the
!LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd
end up with an even bigger mess. Besides, with the patch the core code
is calling varstr_leven_less_equal(), which is the bigger of the two
variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the
!LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of
building HINTs. In short, I don't know what you mean. What would that
reshuffling actually look like?
5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.This may sound ridiculous, but I have already found myself mistyping ctid by
tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on (bad
morning).
I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.This may sound ridiculous, but I have already found myself mistyping ctid by
tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on (bad
morning).I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.
I agree with Peter. This is targeted at regular users.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn'tTo be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
Me, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.
I agree this is bike shedding. But as long as we're bike shedding...
A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance. If they match exactly except for the case and the
unmatched column is all lower case add a comment that quoting is
required due to the mixed case.
Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:
A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance.
That's harder than it sounds. You need even more translatable strings
for variant ereports(). I don't think that an easy to understand rule
is necessarily of much value - I'm already charging half price for
deletion because I found representative errors more useful in certain
cases by doing so. I think we want something that displays the most
useful suggestion as often as is practically possible, and does not
display unhelpful suggestions to the extent that it's practical to
avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff
to scanRTEForColumn() than I already have.
Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.
I think that the most recent revision is somewhat better due to the
feedback of Tom and Robert. I didn't feel as strongly as they did
about erring on the side of not showing a HINT, but I think the most
recent revision is a good compromise. But yes, at this point we're
certainly chasing diminishing returns. There are almost any number of
variants of this basic idea that could be suggested.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.Me, too.
The idea is that each RTE gets one best suggestion, because if there
are two best suggestions within an RTE they're probably both wrong.
Whereas across RTEs, it's probably just that there is a foreign key
relationship between the two (and the user accidentally failed to
qualify the particular column of interest on top of the misspelling, a
qualification that would be sufficient to have the code prefer the
qualified-but-misspelled column). Clearly if I was to do what you
suggest it would be closer to a wild guess, and Tom has expressed
concerns about that.
Now, I don't actually ensure that the column names of the two columns
(each from separate RTEs) are identical save for their would-be alias,
but that's just a consequence of the implementation. Also, as I've
mentioned, I don't want to put more stuff in scanRTEForColumn() than I
already have, due to your earlier concern about adding clutter.
I think we're splitting hairs at this point, and frankly I'll do it
that way if it gets the patch closer to being committed. While I
thought it was important to get the unqualified and misspelled case
right (which I did in the first revision, but perhaps at the expense
of Tom's concern about absolute suggestion quality), I don't feel
strongly about this detail either way.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:
A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance.That's harder than it sounds. You need even more translatable strings
for variant ereports().
Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are. For
instance something like "perhaps you meant a name from the following
list: foo, bar, baz". Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
That's harder than it sounds. You need even more translatable strings
for variant ereports().Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are. For
instance something like "perhaps you meant a name from the following
list: foo, bar, baz". Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.
That's not really the problem. I already have a lot of things to test
in each of the two ereport() calls. More importantly, showing the
closet, say, 3 matches under an arbitrary distance does not weigh
concerns about that indicating that they're all bad. It's not like
bash tab completion - if there is one best match, that's probably
because that's what the user meant. Whereas if there are two or more
within a single RTE, that's probably because both are unhelpful. They
both happened to require the same number of substitutions to get to,
while not being quite bad enough matches to be excluded by the final
check against a normalized distance threshold (the final check that
prevents ludicrous suggestions).
The fact that there were multiple equally plausible candidates (that
are not identically named and just from different RTEs) tells us
plenty, unlike with tab completion. It's not hard for one column to be
a better match than another, and so it doesn't seem unreasonable to
insist upon that within a single RTE where they cannot be identical,
since a conservative approach seems to be what is generally favored.
In any case I'm just trying to weigh everyone's concerns here. I hope
it's actually possible to compromise, but right now I don't know what
I can do to make useful progress.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123
In this example, it seems obvious that both "aa" and "bb" should be
suggested when they are not. But what if there were far more columns,
as might be expected in realistic cases (suppose all other columns
have at least 3 characters)? That's another kettle of fish. The
assumption that it's probably one of those two equally distant columns
is now on very shaky ground. After all, the user can only have meant
one particular column. If we apply a limited kind of Turing test to
this second case, how does the most recent revision's algorithm do?
What would a human suggest? I'm pretty sure the answer is that the
human would shrug. Maybe he or she would say "I guess you might have
meant one of either aa or bb, but that really isn't obvious at all".
That doesn't inspire much confidence.
Now, maybe I should be more optimistic about it being one of the two
because there are only two possibilities to begin with. That seems
pretty dubious, though. In general I find it much more plausible based
on what we know that the user should rethink everything. And, as Tom
pointed out, showing nothing conveys something in itself once users
have been trained to expect something.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 3:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
What would that reshuffling actually look like?
Something like the patch 1 attached...
Btw, re-reading this thread, everybody seem to agree that this is a useful
feature, but we still do not have clear definitions of the circumstances
under which column hints should be produced, except the number (up to two).
So, putting my hands on it and biting the bullet, I have finished with the
two patches attached making the implementation clearer:
- Patch 1 moves levenshtein functions from fuzzystrmatch to core.
- Patch 2 implements the column hints, rather unchanged from original
proposition.
Patch 1 does a couple of things:
- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
it anymore, and moved to core.
- Removal of the LESS_EQUAL flag that made the original submission patch
harder to understand. All the Levenshtein functions wrap a single common
function.
- Documentation is moved, and regression tests for Levenshtein functions
are added.
- Functions with costs are renamed with a suffix with costs.
After hacking this feature, I came up with the conclusion that it would be
better for the user experience to move directly into backend code all the
Levenshtein functions, instead of only moving in the common wrapper as
Peter did in his original patches. This is done this way to avoid keeping
portions of the same feature in two different places of the code (backend
with common routine, fuzzystrmatch with levenshtein functions) and
concentrate all the logic in a single place. Now, we may as well consider
renaming the levenshtein functions into smarter names, like str_distance,
and keep fuzzystrmatch to 1.0, having the functions levenshteing_* calling
only the str_distance functions.
Having a set of in-core distance functions for strings would serve more
general purposes like other object hinting (constraint names, tables, etc.).
Patch 2 is a rebase of the feature of Peter that can be applied on top of
patch 1. The code is rather untouched (haven't much played with Peter's
thingies), well-commented, but I think that this needs more work,
particularly when a query has a single RTE like in this case where no hints
are proposed to the user (mentioned upthread):
create table foo (aa int, bb int);
select ab from foo; -- no hints
Before doing anything more with patch 2, we still need to define clearly
how hints should be produced, so that's clearly out-of-scope for this CF.
Patch 1, though, prepares the field for hints of all kinds, so perhaps we
could argue more on that first?
Regards,
--
Michael
Attachments:
0001-Move-Levenshtein-functions-to-core.patchtext/x-diff; charset=US-ASCII; name=0001-Move-Levenshtein-functions-to-core.patchDownload
From 9a70369cf792f23a556944ec40bbf61f26f514a6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:54:24 +0900
Subject: [PATCH 1/2] Move Levenshtein functions to core
All the functions, part of fuzzystrmatch, able to evaluate distances
between strings are moved into core:
- levenshtein
- levenshtein_less_equal
In order to unify the names of the functions in catalogs, the functions
with costs are appended a prefix *_with_costs.
Documentation, as well as regression tests are added. fuzzystrmatch is
dumped to 1.1 at the same occasion.
---
contrib/fuzzystrmatch/Makefile | 6 +-
contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql | 9 +
contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql | 44 --
contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql | 28 ++
contrib/fuzzystrmatch/fuzzystrmatch.c | 69 ---
contrib/fuzzystrmatch/fuzzystrmatch.control | 2 +-
contrib/fuzzystrmatch/levenshtein.c | 403 ----------------
doc/src/sgml/func.sgml | 178 +++++--
doc/src/sgml/fuzzystrmatch.sgml | 66 ---
src/backend/utils/adt/Makefile | 4 +-
src/backend/utils/adt/levenshtein.c | 543 ++++++++++++++++++++++
src/include/catalog/pg_proc.h | 10 +
src/include/utils/builtins.h | 6 +
src/include/utils/levenshtein.h | 28 ++
src/test/regress/expected/levenshtein.out | 27 ++
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/levenshtein.sql | 8 +
18 files changed, 796 insertions(+), 638 deletions(-)
create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
delete mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
create mode 100644 src/backend/utils/adt/levenshtein.c
create mode 100644 src/include/utils/levenshtein.h
create mode 100644 src/test/regress/expected/levenshtein.out
create mode 100644 src/test/regress/sql/levenshtein.sql
diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..3d3c773 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -4,7 +4,8 @@ MODULE_big = fuzzystrmatch
OBJS = fuzzystrmatch.o dmetaphone.o $(WIN32RES)
EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql
+DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql \
+ fuzzystrmatch--1.0--1.1.sql
PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
ifdef USE_PGXS
@@ -17,6 +18,3 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
new file mode 100644
index 0000000..0fca2a6
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
@@ -0,0 +1,9 @@
+/* contrib/pageinspect/fuzzystrmatch--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO 1.1" to load this file. \quit
+
+DROP FUNCTION levenshtein (text,text);
+DROP FUNCTION levenshtein (text,text,int,int,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int,int,int,int);
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
deleted file mode 100644
index 1cf9b61..0000000
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
+++ /dev/null
@@ -1,44 +0,0 @@
-/* contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
-
-CREATE FUNCTION levenshtein (text,text) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein (text,text,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION metaphone (text,int) RETURNS text
-AS 'MODULE_PATHNAME','metaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION text_soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION difference(text,text) RETURNS int
-AS 'MODULE_PATHNAME', 'difference'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone_alt (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone_alt'
-LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
new file mode 100644
index 0000000..a4861ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
+
+CREATE FUNCTION metaphone (text,int) RETURNS text
+AS 'MODULE_PATHNAME','metaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION text_soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION difference(text,text) RETURNS int
+AS 'MODULE_PATHNAME', 'difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone_alt (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone_alt'
+LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..9923c17 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -40,7 +40,6 @@
#include <ctype.h>
-#include "mb/pg_wchar.h"
#include "utils/builtins.h"
PG_MODULE_MAGIC;
@@ -154,74 +153,6 @@ getcode(char c)
/* These prevent GH from becoming F */
#define NOGHTOF(c) (getcode(c) & 16) /* BDH */
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
- while (len > 0)
- {
- len--;
- if (s1[len] != s2[len])
- return false;
- }
- return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
-PG_FUNCTION_INFO_V1(levenshtein_with_costs);
-Datum
-levenshtein_with_costs(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int ins_c = PG_GETARG_INT32(2);
- int del_c = PG_GETARG_INT32(3);
- int sub_c = PG_GETARG_INT32(4);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein);
-Datum
-levenshtein(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal_with_costs);
-Datum
-levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int ins_c = PG_GETARG_INT32(2);
- int del_c = PG_GETARG_INT32(3);
- int sub_c = PG_GETARG_INT32(4);
- int max_d = PG_GETARG_INT32(5);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal);
-Datum
-levenshtein_less_equal(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int max_d = PG_GETARG_INT32(2);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
-}
-
/*
* Calculates the metaphone of an input string.
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index e257f09..6b2832a 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,5 +1,5 @@
# fuzzystrmatch extension
comment = 'determine similarities and distance between strings'
-default_version = '1.0'
+default_version = '1.1'
module_pathname = '$libdir/fuzzystrmatch'
relocatable = true
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN 255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t. The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification. In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound. From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row. This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length. The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix. When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c)
-#endif
-{
- int m,
- n,
- s_bytes,
- t_bytes;
- int *prev;
- int *curr;
- int *s_char_len = NULL;
- int i,
- j;
- const char *s_data;
- const char *t_data;
- const char *y;
-
- /*
- * For levenshtein_less_equal_internal, we have real variables called
- * start_column and stop_column; otherwise it's just short-hand for 0 and
- * m.
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
- int start_column,
- stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
- /* Extract a pointer to the actual character data. */
- s_data = VARDATA_ANY(s);
- t_data = VARDATA_ANY(t);
-
- /* Determine length of each string in bytes and characters. */
- s_bytes = VARSIZE_ANY_EXHDR(s);
- t_bytes = VARSIZE_ANY_EXHDR(t);
- m = pg_mbstrlen_with_len(s_data, s_bytes);
- n = pg_mbstrlen_with_len(t_data, t_bytes);
-
- /*
- * We can transform an empty s into t with n insertions, or a non-empty t
- * into an empty s with m deletions.
- */
- if (!m)
- return n * ins_c;
- if (!n)
- return m * del_c;
-
- /*
- * For security concerns, restrict excessive CPU+RAM usage. (This
- * implementation uses O(m) memory and has O(mn) complexity.)
- */
- if (m > MAX_LEVENSHTEIN_STRLEN ||
- n > MAX_LEVENSHTEIN_STRLEN)
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("argument exceeds the maximum length of %d bytes",
- MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
- /* Initialize start and stop columns. */
- start_column = 0;
- stop_column = m + 1;
-
- /*
- * If max_d >= 0, determine whether the bound is impossibly tight. If so,
- * return max_d + 1 immediately. Otherwise, determine whether it's tight
- * enough to limit the computation we must perform. If so, figure out
- * initial stop column.
- */
- if (max_d >= 0)
- {
- int min_theo_d; /* Theoretical minimum distance. */
- int max_theo_d; /* Theoretical maximum distance. */
- int net_inserts = n - m;
-
- min_theo_d = net_inserts < 0 ?
- -net_inserts * del_c : net_inserts * ins_c;
- if (min_theo_d > max_d)
- return max_d + 1;
- if (ins_c + del_c < sub_c)
- sub_c = ins_c + del_c;
- max_theo_d = min_theo_d + sub_c * Min(m, n);
- if (max_d >= max_theo_d)
- max_d = -1;
- else if (ins_c + del_c > 0)
- {
- /*
- * Figure out how much of the first row of the notional matrix we
- * need to fill in. If the string is growing, the theoretical
- * minimum distance already incorporates the cost of deleting the
- * number of characters necessary to make the two strings equal in
- * length. Each additional deletion forces another insertion, so
- * the best-case total cost increases by ins_c + del_c. If the
- * string is shrinking, the minimum theoretical cost assumes no
- * excess deletions; that is, we're starting no further right than
- * column n - m. If we do start further right, the best-case
- * total cost increases by ins_c + del_c for each move right.
- */
- int slack_d = max_d - min_theo_d;
- int best_column = net_inserts < 0 ? -net_inserts : 0;
-
- stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
- if (stop_column > m)
- stop_column = m + 1;
- }
- }
-#endif
-
- /*
- * In order to avoid calling pg_mblen() repeatedly on each character in s,
- * we cache all the lengths before starting the main loop -- but if all
- * the characters in both strings are single byte, then we skip this and
- * use a fast-path in the main loop. If only one string contains
- * multi-byte characters, we still build the array, so that the fast-path
- * needn't deal with the case where the array hasn't been initialized.
- */
- if (m != s_bytes || n != t_bytes)
- {
- int i;
- const char *cp = s_data;
-
- s_char_len = (int *) palloc((m + 1) * sizeof(int));
- for (i = 0; i < m; ++i)
- {
- s_char_len[i] = pg_mblen(cp);
- cp += s_char_len[i];
- }
- s_char_len[i] = 0;
- }
-
- /* One more cell for initialization column and row. */
- ++m;
- ++n;
-
- /* Previous and current rows of notional array. */
- prev = (int *) palloc(2 * m * sizeof(int));
- curr = prev + m;
-
- /*
- * To transform the first i characters of s into the first 0 characters of
- * t, we must perform i deletions.
- */
- for (i = START_COLUMN; i < STOP_COLUMN; i++)
- prev[i] = i * del_c;
-
- /* Loop through rows of the notional array */
- for (y = t_data, j = 1; j < n; j++)
- {
- int *temp;
- const char *x = s_data;
- int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * In the best case, values percolate down the diagonal unchanged, so
- * we must increment stop_column unless it's already on the right end
- * of the array. The inner loop will read prev[stop_column], so we
- * have to initialize it even though it shouldn't affect the result.
- */
- if (stop_column < m)
- {
- prev[stop_column] = max_d + 1;
- ++stop_column;
- }
-
- /*
- * The main loop fills in curr, but curr[0] needs a special case: to
- * transform the first 0 characters of s into the first j characters
- * of t, we must perform j insertions. However, if start_column > 0,
- * this special case does not apply.
- */
- if (start_column == 0)
- {
- curr[0] = j * ins_c;
- i = 1;
- }
- else
- i = start_column;
-#else
- curr[0] = j * ins_c;
- i = 1;
-#endif
-
- /*
- * This inner loop is critical to performance, so we include a
- * fast-path to handle the (fairly common) case where no multibyte
- * characters are in the mix. The fast-path is entitled to assume
- * that if s_char_len is not initialized then BOTH strings contain
- * only single-byte characters.
- */
- if (s_char_len != NULL)
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
- int x_char_len = s_char_len[i - 1];
-
- /*
- * Calculate costs for insertion, deletion, and substitution.
- *
- * When calculating cost for substitution, we compare the last
- * character of each possibly-multibyte character first,
- * because that's enough to rule out most mis-matches. If we
- * get past that test, then we compare the lengths and the
- * remaining bytes.
- */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- if (x[x_char_len - 1] == y[y_char_len - 1]
- && x_char_len == y_char_len &&
- (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
- sub = prev[i - 1];
- else
- sub = prev[i - 1] + sub_c;
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x += x_char_len;
- }
- }
- else
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
-
- /* Calculate costs for insertion, deletion, and substitution. */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x++;
- }
- }
-
- /* Swap current row with previous row. */
- temp = curr;
- curr = prev;
- prev = temp;
-
- /* Point to next character. */
- y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * This chunk of code represents a significant performance hit if used
- * in the case where there is no max_d bound. This is probably not
- * because the max_d >= 0 test itself is expensive, but rather because
- * the possibility of needing to execute this code prevents tight
- * optimization of the loop as a whole.
- */
- if (max_d >= 0)
- {
- /*
- * The "zero point" is the column of the current row where the
- * remaining portions of the strings are of equal length. There
- * are (n - 1) characters in the target string, of which j have
- * been transformed. There are (m - 1) characters in the source
- * string, so we want to find the value for zp where (n - 1) - j =
- * (m - 1) - zp.
- */
- int zp = j - (n - m);
-
- /* Check whether the stop column can slide left. */
- while (stop_column > 0)
- {
- int ii = stop_column - 1;
- int net_inserts = ii - zp;
-
- if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
- stop_column--;
- }
-
- /* Check whether the start column can slide right. */
- while (start_column < stop_column)
- {
- int net_inserts = start_column - zp;
-
- if (prev[start_column] +
- (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
-
- /*
- * We'll never again update these values, so we must make sure
- * there's nothing here that could confuse any future
- * iteration of the outer loop.
- */
- prev[start_column] = max_d + 1;
- curr[start_column] = max_d + 1;
- if (start_column != 0)
- s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
- start_column++;
- }
-
- /* If they cross, we're going to exceed the bound. */
- if (start_column >= stop_column)
- return max_d + 1;
- }
-#endif
- }
-
- /*
- * Because the final value was swapped from the previous row to the
- * current row, that's where we'll find it.
- */
- return prev[m - 1];
-}
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index bf13140..979f87f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4029,20 +4029,20 @@ regexp_replace('foobarbaz', 'b(..)', E'X\\1Y', 'g')
Some examples:
<programlisting>
SELECT regexp_matches('foobarbequebaz', '(bar)(beque)');
- regexp_matches
+ regexp_matches
----------------
{bar,beque}
(1 row)
SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
- regexp_matches
+ regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)
SELECT regexp_matches('foobarbequebaz', 'barbeque');
- regexp_matches
+ regexp_matches
----------------
{barbeque}
(1 row)
@@ -4089,44 +4089,44 @@ SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
<programlisting>
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', E'\\s+') AS foo;
- foo
+ foo
-------
- the
- quick
- brown
- fox
- jumps
- over
- the
- lazy
- dog
+ the
+ quick
+ brown
+ fox
+ jumps
+ over
+ the
+ lazy
+ dog
(9 rows)
SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', E'\\s+');
- regexp_split_to_array
+ regexp_split_to_array
-----------------------------------------------
{the,quick,brown,fox,jumps,over,the,lazy,dog}
(1 row)
SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
- foo
+ foo
-----
- t
- h
- e
- q
- u
- i
- c
- k
- b
- r
- o
- w
- n
- f
- o
- x
+ t
+ h
+ e
+ q
+ u
+ i
+ c
+ k
+ b
+ r
+ o
+ w
+ n
+ f
+ o
+ x
(16 rows)
</programlisting>
</para>
@@ -5796,7 +5796,7 @@ SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Casting does not have this behavior.
</para>
</listitem>
-
+
<listitem>
<para>
Ordinary text is allowed in <function>to_char</function>
@@ -7893,6 +7893,88 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
</para>
</sect1>
+ <sect1 id="functions-levenshtein">
+ <title>Levenshtein functions</title>
+
+ <para>
+ Levenshtein functions provide ways to calculate a distance between two
+ strings.
+ </para>
+
+ <table id="functions-levenshtein-table">
+ <title>Levenshtein Functions</title>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Function</entry>
+ <entry>Description</entry>
+ <entry>Example</entry>
+ <entry>Example Result</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein</primary>
+ </indexterm>
+ <literal>levenshtein(text source, text target)</literal>
+ </entry>
+ <entry>Returns the distance between the two given strings</entry>
+ <entry><literal>levenshtein('GUMBO', 'GAMBOL')</literal></entry>
+ <entry><literal>2</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein_with_costs</primary>
+ </indexterm>
+ <literal>levenshtein_with_costs(text source, text target, int ins_cost, int del_cost, int sub_cost)</literal>
+ </entry>
+ <entry>Returns the distance between the two given strings depending on costs</entry>
+ <entry><literal>levenshtein('GUMBO', 'GAMBOL', 2, 1, 1)</literal></entry>
+ <entry><literal>3</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein_less_equal</primary>
+ </indexterm>
+ <literal>levenshtein_less_equal(text source, text target, int max_d)</literal>
+ </entry>
+ <entry>Returns the less-equal distance between the two given strings</entry>
+ <entry><literal>levenshtein_less_equal('extensive', 'exhaustive', 2)</literal></entry>
+ <entry><literal>3</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein_less_equal_with_costs</primary>
+ </indexterm>
+ <literal>levenshtein_less_equal_with_costs(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d)</literal>
+ </entry>
+ <entry>Returns the less-equal distance between the two given strings with costs</entry>
+ <entry><literal>levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4)</literal></entry>
+ <entry><literal>4</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ Both <literal>source</literal> and <literal>target</literal> can be any
+ non-null string, with a maximum of 255 bytes. The cost parameters
+ specify how much to charge for a character insertion, deletion, or
+ substitution, respectively. You can omit the cost parameters, as in
+ the second version of the function; in that case they all default to 1.
+ <literal>levenshtein_less_equal</literal> is accelerated version of
+ levenshtein function for low values of distance. If actual distance
+ is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
+ returns accurate value of it. Otherwise this function returns value
+ which is greater than max_d.
+ </para>
+ </sect1>
+
<sect1 id="functions-geometry">
<title>Geometric Functions and Operators</title>
@@ -9686,32 +9768,32 @@ SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Tor
<screen><![CDATA[
SET xmloption TO DOCUMENT;
SELECT xml_is_well_formed('<>');
- xml_is_well_formed
+ xml_is_well_formed
--------------------
f
(1 row)
SELECT xml_is_well_formed('<abc/>');
- xml_is_well_formed
+ xml_is_well_formed
--------------------
t
(1 row)
SET xmloption TO CONTENT;
SELECT xml_is_well_formed('abc');
- xml_is_well_formed
+ xml_is_well_formed
--------------------
t
(1 row)
SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');
- xml_is_well_formed_document
+ xml_is_well_formed_document
-----------------------------
t
(1 row)
SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
- xml_is_well_formed_document
+ xml_is_well_formed_document
-----------------------------
f
(1 row)
@@ -9774,7 +9856,7 @@ SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuf
SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
ARRAY[ARRAY['my', 'http://example.com']]);
- xpath
+ xpath
--------
{test}
(1 row)
@@ -9817,7 +9899,7 @@ SELECT xpath('//mydefns:b/text()', '<a xmlns="http://example.com"><b>test</b></a
SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
ARRAY[ARRAY['my', 'http://example.com']]);
- xpath_exists
+ xpath_exists
--------------
t
(1 row)
@@ -14125,7 +14207,7 @@ SELECT current_date + s.a AS dates FROM generate_series(0,14,7) AS s(a);
SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
'2008-03-04 12:00', '10 hours');
- generate_series
+ generate_series
---------------------
2008-03-01 00:00:00
2008-03-01 10:00:00
@@ -14188,7 +14270,7 @@ SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
<programlisting>
-- basic usage
SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s;
- s
+ s
---
1
2
@@ -14199,7 +14281,7 @@ SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s;
-- presenting an array, the subscript and the subscripted
-- value requires a subquery
SELECT * FROM arrays;
- a
+ a
--------------------
{-1,-2}
{100,200,300}
@@ -14225,7 +14307,7 @@ select $1[i][j]
$$ LANGUAGE sql IMMUTABLE;
CREATE FUNCTION
SELECT * FROM unnest2(ARRAY[[1,2],[3,4]]);
- unnest2
+ unnest2
---------
1
2
@@ -15619,13 +15701,13 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
<programlisting>
SELECT pg_typeof(33);
- pg_typeof
+ pg_typeof
-----------
integer
(1 row)
SELECT typlen FROM pg_type WHERE oid = pg_typeof(33);
- typlen
+ typlen
--------
4
(1 row)
@@ -15637,13 +15719,13 @@ SELECT typlen FROM pg_type WHERE oid = pg_typeof(33);
value that is passed to it. Example:
<programlisting>
SELECT collation for (description) FROM pg_description LIMIT 1;
- pg_collation_for
+ pg_collation_for
------------------
"default"
(1 row)
SELECT collation for ('foo' COLLATE "de_DE");
- pg_collation_for
+ pg_collation_for
------------------
"de_DE"
(1 row)
@@ -16313,7 +16395,7 @@ postgres=# select pg_start_backup('label_goes_here');
above functions. For example:
<programlisting>
postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
- file_name | file_offset
+ file_name | file_offset
--------------------------+-------------
00000001000000000000000D | 4039624
(1 row)
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index f26bd90..f95d5aa 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -83,72 +83,6 @@ SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
</sect2>
<sect2>
- <title>Levenshtein</title>
-
- <para>
- This function calculates the Levenshtein distance between two strings:
- </para>
-
- <indexterm>
- <primary>levenshtein</primary>
- </indexterm>
-
- <indexterm>
- <primary>levenshtein_less_equal</primary>
- </indexterm>
-
-<synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
-</synopsis>
-
- <para>
- Both <literal>source</literal> and <literal>target</literal> can be any
- non-null string, with a maximum of 255 bytes. The cost parameters
- specify how much to charge for a character insertion, deletion, or
- substitution, respectively. You can omit the cost parameters, as in
- the second version of the function; in that case they all default to 1.
- <literal>levenshtein_less_equal</literal> is accelerated version of
- levenshtein function for low values of distance. If actual distance
- is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
- returns accurate value of it. Otherwise this function returns value
- which is greater than max_d.
- </para>
-
- <para>
- Examples:
- </para>
-
-<screen>
-test=# SELECT levenshtein('GUMBO', 'GAMBOL');
- levenshtein
--------------
- 2
-(1 row)
-
-test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
- levenshtein
--------------
- 3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
- levenshtein_less_equal
-------------------------
- 3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
- levenshtein_less_equal
-------------------------
- 4
-(1 row)
-</screen>
- </sect2>
-
- <sect2>
<title>Metaphone</title>
<para>
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..7071afe 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,8 +22,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
encode.o enum.o float.o format_type.o formatting.o genfile.o \
geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
int8.o json.o jsonb.o jsonb_gin.o jsonb_op.o jsonb_util.o \
- jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \
- network.o network_gist.o network_selfuncs.o \
+ jsonfuncs.o levenshtein.o like.o lockfuncs.o mac.o misc.o nabstime.o \
+ name.o network.o network_gist.o network_selfuncs.o \
numeric.o numutils.o oid.o oracle_compat.o \
orderedsetaggs.o pg_lzcompress.o pg_locale.o pg_lsn.o \
pgstatfuncs.o pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..d7c9c68
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,543 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ * Levenshtein distance implementation.
+ *
+ * Original author: Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costsings and a "max" value above which exact distances are not
+ * interesting. Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm. Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration. Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/levenshtein.h"
+
+#include "fmgr.h"
+#include "utils/builtins.h"
+
+#include "mb/pg_wchar.h"
+
+#define MAX_LEVENSHTEIN_STRLEN 255
+
+/*
+ * varstr_leven()
+ * varstr_leven_less_equal()
+ * Levenshtein distance functions. All arguments should be strlen(s) <= 255.
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
+ *
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+ while (len > 0)
+ {
+ len--;
+ if (s1[len] != s2[len])
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Calculates Levenshtein distance metric between supplied csrings, which are
+ * not necessarily null-terminated. Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t. The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification. In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound. From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row. This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length. The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix. When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+/*
+ * varstr_leven_common
+ *
+ * Common routine for all Levenstein functions.
+ */
+static int
+levenshtein_common(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+ int m, n;
+ int *prev;
+ int *curr;
+ int *s_char_len = NULL;
+ int i,
+ j;
+ const char *y;
+ int max_init;
+ int start_column,
+ stop_column;
+ int start_column_local, stop_column_local;
+
+ /* Save value of max_d */
+ max_init = max_d;
+
+ m = pg_mbstrlen_with_len(source, slen);
+ n = pg_mbstrlen_with_len(target, tlen);
+
+ /*
+ * We can transform an empty s into t with n insertions, or a non-empty t
+ * into an empty s with m deletions.
+ */
+ if (!m)
+ return n * ins_c;
+ if (!n)
+ return m * del_c;
+
+ /*
+ * A common use for Levenshtein distance is to match column names.
+ * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+ * guaranteed to work.
+ */
+ StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+ "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+ /*
+ * For security concerns, restrict excessive CPU+RAM usage. (This
+ * implementation uses O(m) memory and has O(mn) complexity.)
+ */
+ if (m > MAX_LEVENSHTEIN_STRLEN ||
+ n > MAX_LEVENSHTEIN_STRLEN)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("argument exceeds the maximum length of %d bytes",
+ MAX_LEVENSHTEIN_STRLEN)));
+
+ /*
+ * XXX: This is the beginning of the first loop originally defined with
+ * LEVENSHTEIN_LESS_EQUAL
+ */
+ if (max_init >= 0)
+ {
+ /* Initialize start and stop columns. */
+ start_column = 0;
+ stop_column = m + 1;
+
+ /*
+ * If max_d >= 0, determine whether the bound is impossibly tight. If so,
+ * return max_d + 1 immediately. Otherwise, determine whether it's tight
+ * enough to limit the computation we must perform. If so, figure out
+ * initial stop column.
+ */
+ if (max_d >= 0)
+ {
+ int min_theo_d; /* Theoretical minimum distance. */
+ int max_theo_d; /* Theoretical maximum distance. */
+ int net_inserts = n - m;
+
+ min_theo_d = net_inserts < 0 ?
+ -net_inserts * del_c : net_inserts * ins_c;
+ if (min_theo_d > max_d)
+ return max_d + 1;
+ if (ins_c + del_c < sub_c)
+ sub_c = ins_c + del_c;
+ max_theo_d = min_theo_d + sub_c * Min(m, n);
+ if (max_d >= max_theo_d)
+ max_d = -1;
+ else if (ins_c + del_c > 0)
+ {
+ /*
+ * Figure out how much of the first row of the notional matrix we
+ * need to fill in. If the string is growing, the theoretical
+ * minimum distance already incorporates the cost of deleting the
+ * number of characters necessary to make the two strings equal in
+ * length. Each additional deletion forces another insertion, so
+ * the best-case total cost increases by ins_c + del_c. If the
+ * string is shrinking, the minimum theoretical cost assumes no
+ * excess deletions; that is, we're starting no further right than
+ * column n - m. If we do start further right, the best-case
+ * total cost increases by ins_c + del_c for each move right.
+ */
+ int slack_d = max_d - min_theo_d;
+ int best_column = net_inserts < 0 ? -net_inserts : 0;
+
+ stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+ if (stop_column > m)
+ stop_column = m + 1;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * Be sure to set if correctly stop and start columns in all cases.
+ */
+ start_column = 0;
+ stop_column = m;
+ }
+
+ /*
+ * In order to avoid calling pg_mblen() repeatedly on each character in s,
+ * we cache all the lengths before starting the main loop -- but if all
+ * the characters in both strings are single byte, then we skip this and
+ * use a fast-path in the main loop. If only one string contains
+ * multi-byte characters, we still build the array, so that the fast-path
+ * needn't deal with the case where the array hasn't been initialized.
+ */
+ if (m != slen || n != tlen)
+ {
+ int i;
+ const char *cp = source;
+
+ s_char_len = (int *) palloc((m + 1) * sizeof(int));
+ for (i = 0; i < m; ++i)
+ {
+ s_char_len[i] = pg_mblen(cp);
+ cp += s_char_len[i];
+ }
+ s_char_len[i] = 0;
+ }
+
+ /* One more cell for initialization column and row. */
+ ++m;
+ ++n;
+
+ /* Previous and current rows of notional array. */
+ prev = (int *) palloc(2 * m * sizeof(int));
+ curr = prev + m;
+
+ /*
+ * To transform the first i characters of s into the first 0 characters of
+ * t, we must perform i deletions.
+ */
+ if (max_init >= 0)
+ stop_column_local = stop_column;
+ else
+ stop_column_local = m;
+
+ for (i = 0; i < stop_column_local; i++)
+ prev[i] = i * del_c;
+
+ /* Loop through rows of the notional array */
+ for (y = target, j = 1; j < n; j++)
+ {
+ int *temp;
+ const char *x = source;
+ int y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+ /*
+ * XXX: This is the second loop originally defined with
+ * LEVENSHTEIN_LESS_EQUAL
+ */
+ if (max_init >= 0)
+ {
+ /*
+ * In the best case, values percolate down the diagonal unchanged, so
+ * we must increment stop_column unless it's already on the right end
+ * of the array. The inner loop will read prev[stop_column], so we
+ * have to initialize it even though it shouldn't affect the result.
+ */
+ if (stop_column < m)
+ {
+ prev[stop_column] = max_d + 1;
+ ++stop_column;
+ }
+
+ /*
+ * The main loop fills in curr, but curr[0] needs a special case: to
+ * transform the first 0 characters of s into the first j characters
+ * of t, we must perform j insertions. However, if start_column > 0,
+ * this special case does not apply.
+ */
+ if (start_column == 0)
+ {
+ curr[0] = j * ins_c;
+ i = 1;
+ }
+ else
+ i = start_column;
+ }
+ else
+ {
+ curr[0] = j * ins_c;
+ i = 1;
+ }
+
+ /*
+ * This inner loop is critical to performance, so we include a
+ * fast-path to handle the (fairly common) case where no multibyte
+ * characters are in the mix. The fast-path is entitled to assume
+ * that if s_char_len is not initialized then BOTH strings contain
+ * only single-byte characters.
+ */
+ if (s_char_len != NULL)
+ {
+ if (max_init < 0)
+ stop_column_local = m;
+ else
+ stop_column_local = stop_column;
+
+ for (; i < stop_column_local; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+ int x_char_len = s_char_len[i - 1];
+
+ /*
+ * Calculate costs for insertion, deletion, and substitution.
+ *
+ * When calculating cost for substitution, we compare the last
+ * character of each possibly-multibyte character first,
+ * because that's enough to rule out most mis-matches. If we
+ * get past that test, then we compare the lengths and the
+ * remaining bytes.
+ */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ if (x[x_char_len - 1] == y[y_char_len - 1]
+ && x_char_len == y_char_len &&
+ (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+ sub = prev[i - 1];
+ else
+ sub = prev[i - 1] + sub_c;
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x += x_char_len;
+ }
+ }
+ else
+ {
+ if (max_init < 0)
+ stop_column_local = m;
+ else
+ stop_column_local = stop_column;
+
+ for (; i < stop_column_local; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+
+ /* Calculate costs for insertion, deletion, and substitution. */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x++;
+ }
+ }
+
+ /* Swap current row with previous row. */
+ temp = curr;
+ curr = prev;
+ prev = temp;
+
+ /* Point to next character. */
+ y += y_char_len;
+
+ /*
+ * This chunk of code represents a significant performance hit if used
+ * in the case where there is no max_d bound. This is probably not
+ * because the max_d >= 0 test itself is expensive, but rather because
+ * the possibility of needing to execute this code prevents tight
+ * optimization of the loop as a whole.
+ */
+ if (max_init >= 0 && max_d >= 0)
+ {
+ /*
+ * The "zero point" is the column of the current row where the
+ * remaining portions of the strings are of equal length. There
+ * are (n - 1) characters in the target string, of which j have
+ * been transformed. There are (m - 1) characters in the source
+ * string, so we want to find the value for zp where (n - 1) - j =
+ * (m - 1) - zp.
+ */
+ int zp = j - (n - m);
+
+ /* Check whether the stop column can slide left. */
+ while (stop_column > 0)
+ {
+ int ii = stop_column - 1;
+ int net_inserts = ii - zp;
+
+ if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+ stop_column--;
+ }
+
+ /* Check whether the start column can slide right. */
+ while (start_column < stop_column)
+ {
+ int net_inserts = start_column - zp;
+
+ if (prev[start_column] +
+ (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+
+ /*
+ * We'll never again update these values, so we must make sure
+ * there's nothing here that could confuse any future
+ * iteration of the outer loop.
+ */
+ prev[start_column] = max_d + 1;
+ curr[start_column] = max_d + 1;
+ if (start_column != 0)
+ source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+ start_column++;
+ }
+
+ /* If they cross, we're going to exceed the bound. */
+ if (start_column >= stop_column)
+ return max_d + 1;
+ }
+ }
+
+ /*
+ * Because the final value was swapped from the previous row to the
+ * current row, that's where we'll find it.
+ */
+ return prev[m - 1];
+}
+
+int
+levenshtein_less_equal_internal(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+ return levenshtein_common(source, slen, target, tlen, ins_c,
+ del_c, sub_c, max_d);
+}
+
+int
+levenshtein_internal(const char *source, int slen, const char *target, int tlen,
+ int ins_c, int del_c, int sub_c)
+{
+ return levenshtein_common(source, slen, target, tlen, ins_c,
+ del_c, sub_c, -1);
+}
+
+Datum
+levenshtein(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+ t_bytes, 1, 1, 1));
+}
+
+Datum
+levenshtein_with_costs(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int ins_c = PG_GETARG_INT32(2);
+ int del_c = PG_GETARG_INT32(3);
+ int sub_c = PG_GETARG_INT32(4);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+ t_bytes, ins_c, del_c, sub_c));
+}
+
+Datum
+levenshtein_less_equal(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int max_d = PG_GETARG_INT32(2);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+ t_data, t_bytes, 1, 1, 1, max_d));
+}
+
+Datum
+levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int ins_c = PG_GETARG_INT32(2);
+ int del_c = PG_GETARG_INT32(3);
+ int sub_c = PG_GETARG_INT32(4);
+ int max_d = PG_GETARG_INT32(5);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+ t_data, t_bytes, ins_c, del_c, sub_c, max_d));
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..bcc13cb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4973,6 +4973,16 @@ DESCR("peek at changes from replication slot");
DATA(insert OID = 3785 ( pg_logical_slot_peek_binary_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,17}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ pg_logical_slot_peek_binary_changes _null_ _null_ _null_ ));
DESCR("peek at binary changes from replication slot");
+/* levenshtein distance */
+DATA(insert OID = 3366 ( levenshtein PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "25 25" _null_ _null_ _null_ _null_ levenshtein _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings");
+DATA(insert OID = 3367 ( levenshtein_with_costs PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 23 "25 25 23 23 23" _null_ _null_ _null_ _null_ levenshtein_with_costs _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings with costs");
+DATA(insert OID = 3368 ( levenshtein_less_equal PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 23 "25 25 23" _null_ _null_ _null_ _null_ levenshtein_less_equal _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings");
+DATA(insert OID = 3369 ( levenshtein_less_equal_with_costs PGNSP PGUID 12 1 0 0 0 f f f f t f i 6 0 23 "25 25 23 23 23 23" _null_ _null_ _null_ _null_ levenshtein_less_equal_with_costs _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings with costs");
+
/* event triggers */
DATA(insert OID = 3566 ( pg_event_trigger_dropped_objects PGNSP PGUID 12 10 100 0 0 f f f f t t s 0 0 2249 "" "{26,26,23,25,25,25,25}" "{o,o,o,o,o,o,o}" "{classid, objid, objsubid, object_type, schema_name, object_name, object_identity}" _null_ pg_event_trigger_dropped_objects _null_ _null_ _null_ ));
DESCR("list objects dropped by the current command");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index bbb5d39..b468c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -851,6 +851,12 @@ extern Datum cidrecv(PG_FUNCTION_ARGS);
extern Datum cidsend(PG_FUNCTION_ARGS);
extern Datum cideq(PG_FUNCTION_ARGS);
+/* levenshtein.c */
+extern Datum levenshtein(PG_FUNCTION_ARGS);
+extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+
/* like.c */
extern Datum namelike(PG_FUNCTION_ARGS);
extern Datum namenlike(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/levenshtein.h b/src/include/utils/levenshtein.h
new file mode 100644
index 0000000..65829e2
--- /dev/null
+++ b/src/include/utils/levenshtein.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * src/include/utils/levenshtein.h
+ * Header file for the Levenshtein distance functions, internal
+ * and system functions.
+ *
+ * Copyright (c) 2007-2014, PostgreSQL Global Development Group
+ *
+ * src/include/utils/levenshtein.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+#include "postgres.h"
+
+/* Internal functions */
+extern int levenshtein_less_equal_internal(const char *source, int slen,
+ const char *target, int tlen,
+ int ins_c, int del_c, int sub_c, int max_d);
+
+extern int levenshtein_internal(const char *source, int slen,
+ const char *target, int tlen,
+ int ins_c, int del_c, int sub_c);
+
+#endif
diff --git a/src/test/regress/expected/levenshtein.out b/src/test/regress/expected/levenshtein.out
new file mode 100644
index 0000000..57fd083
--- /dev/null
+++ b/src/test/regress/expected/levenshtein.out
@@ -0,0 +1,27 @@
+--
+-- LEVENSHTEIN
+--
+SELECT levenshtein('GUMBO', 'GAMBOL');
+ levenshtein
+-------------
+ 2
+(1 row)
+
+SELECT levenshtein_with_costs('GUMBO', 'GAMBOL', 2, 1, 1);
+ levenshtein_with_costs
+------------------------
+ 3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+ levenshtein_less_equal
+------------------------
+ 3
+(1 row)
+
+SELECT levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4);
+ levenshtein_less_equal_with_costs
+-----------------------------------
+ 4
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c0416f4..5faf182 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: tablespace
# ----------
# The first group of parallel tests
# ----------
-test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
+test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc levenshtein
# Depends on things setup during char, varchar and text
test: strings
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 16a1905..e980619 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -21,6 +21,7 @@ test: money
test: rangetypes
test: pg_lsn
test: regproc
+test: levenshtein
test: strings
test: numerology
test: point
diff --git a/src/test/regress/sql/levenshtein.sql b/src/test/regress/sql/levenshtein.sql
new file mode 100644
index 0000000..ea69a37
--- /dev/null
+++ b/src/test/regress/sql/levenshtein.sql
@@ -0,0 +1,8 @@
+--
+-- LEVENSHTEIN
+--
+
+SELECT levenshtein('GUMBO', 'GAMBOL');
+SELECT levenshtein_with_costs('GUMBO', 'GAMBOL', 2, 1, 1);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+SELECT levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4);
--
2.0.1
0002-Support-for-column-hints.patchtext/x-diff; charset=US-ASCII; name=0002-Support-for-column-hints.patchDownload
From 4d0d46bd57b4f4f3a962f5b27634a6174ca2acde Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:58:25 +0900
Subject: [PATCH 2/2] Support for column hints
If incorrect column names are written in a query, system tries to
evaluate if there are columns on existing RTEs that are close in
distance to the one mistaken, and returns to user hints according
to the evaluation done.
---
src/backend/parser/parse_expr.c | 9 +-
src/backend/parser/parse_func.c | 2 +-
src/backend/parser/parse_relation.c | 318 ++++++++++++++++++++++++++----
src/include/parser/parse_relation.h | 3 +-
src/test/regress/expected/alter_table.out | 8 +
src/test/regress/expected/join.out | 39 ++++
src/test/regress/expected/plpgsql.out | 1 +
src/test/regress/expected/rowtypes.out | 1 +
src/test/regress/expected/rules.out | 1 +
src/test/regress/expected/without_oid.out | 1 +
src/test/regress/sql/join.sql | 24 +++
11 files changed, 366 insertions(+), 41 deletions(-)
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field2);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field3);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field4);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
((Var *) first_arg)->varno,
((Var *) first_arg)->varlevelsup);
/* Return a Var if funcname matches a column, else NULL */
- return scanRTEForColumn(pstate, rte, funcname, location);
+ return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
}
/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..2838f89 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include <ctype.h>
+#include <limits.h>
#include "access/htup_details.h"
#include "access/sysattr.h"
@@ -28,6 +29,7 @@
#include "parser/parse_relation.h"
#include "parser/parse_type.h"
#include "utils/builtins.h"
+#include "utils/levenshtein.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/syscache.h"
@@ -520,6 +522,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
}
/*
+ * distanceName
+ * Return Levenshtein distance between an actual column name and possible
+ * partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+ int len = strlen(actual),
+ match_len = strlen(match);
+
+ /* Charge half as much per deletion as per insertion or per substitution */
+ return levenshtein_less_equal_internal(actual, len, match, match_len,
+ 2, 1, 2, max);
+}
+
+/*
* scanRTEForColumn
* Search the column names of a single RTE for the given name.
* If found, return an appropriate Var node, else return NULL.
@@ -527,10 +545,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
*
* Side effect: if we find a match, mark the RTE as requiring read access
* for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance. Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match). This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful. Note that we never consider system column names
+ * when performing fuzzy matching.
*/
Node *
scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
- int location)
+ int location, AttrNumber *closest, int *distance)
{
Node *result = NULL;
int attnum = 0;
@@ -548,12 +580,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
* Should this somehow go wrong and we try to access a dropped column,
* we'll still catch it by virtue of the checks in
* get_rte_attribute_type(), which is called by make_var(). That routine
- * has to do a cache lookup anyway, so the check there is cheap.
+ * has to do a cache lookup anyway, so the check there is cheap. Callers
+ * interested in finding match with shortest distance need to defend
+ * against this directly, though.
*/
foreach(c, rte->eref->colnames)
{
+ const char *attcolname = strVal(lfirst(c));
+
attnum++;
- if (strcmp(strVal(lfirst(c)), colname) == 0)
+ if (strcmp(attcolname, colname) == 0)
{
if (result)
ereport(ERROR,
@@ -566,6 +602,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
markVarForSelectPriv(pstate, var, rte);
result = (Node *) var;
}
+
+ if (distance && *distance != 0)
+ {
+ if (result)
+ {
+ /* Exact match just found */
+ *distance = 0;
+ }
+ else
+ {
+ int lowestdistance = *distance;
+ int thisdistance = distanceName(attcolname, colname,
+ lowestdistance);
+
+ if (thisdistance >= lowestdistance)
+ {
+ /*
+ * This match distance may equal a prior match within this
+ * same range table. When that happens, the prior match is
+ * discarded as worthless, since a single best match is
+ * required within a RTE.
+ */
+ if (thisdistance == lowestdistance)
+ *closest = InvalidAttrNumber;
+
+ continue;
+ }
+
+ /* Store new lowest observed distance for RT */
+ *distance = thisdistance;
+ }
+ *closest = attnum;
+ }
}
/*
@@ -642,7 +711,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
continue;
/* use orig_pstate here to get the right sublevels_up */
- newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+ newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+ NULL, NULL);
if (newresult)
{
@@ -668,8 +738,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
/*
* searchRangeTableForCol
- * See if any RangeTblEntry could possibly provide the given column name.
- * If so, return a pointer to the RangeTblEntry; else return NULL.
+ * See if any RangeTblEntry could possibly provide the given column name (or
+ * find the best match available). Returns a list of equally likely
+ * candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily; we provide the closet columns if there
+ * was not an exact match. Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
*
* This is different from colNameToVar in that it considers every entry in
* the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +754,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
* matches, but only one will be returned). This must be used ONLY as a
* heuristic in giving suitable error messages. See errorMissingColumn.
*/
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+ int location, AttrNumber closest[2])
{
- ParseState *orig_pstate = pstate;
+ ParseState *orig_pstate = pstate;
+ int distance = INT_MAX;
+ List *matchedrte = NIL;
+ ListCell *l;
+ int i;
while (pstate != NULL)
{
- ListCell *l;
-
foreach(l, pstate->p_rtable)
{
- RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ AttrNumber rteclosest = InvalidAttrNumber;
+ int rtdistance = INT_MAX;
+ bool wrongalias;
- if (scanRTEForColumn(orig_pstate, rte, colname, location))
- return rte;
+ /*
+ * Get single best match from each RTE, or no match for RTE if
+ * there is a tie for best match within a given RTE
+ */
+ scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+ &rtdistance);
+
+ /* Was alias provided by user that does not match entry's alias? */
+ wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+ if (rtdistance == 0)
+ {
+ /* Exact match (for "wrong alias" or "wrong level" cases) */
+ closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+ /*
+ * Any exact match is always the uncontested best match. It
+ * doesn't seem worth considering the case where there are
+ * multiple exact matches, so we're done.
+ */
+ matchedrte = lappend(NIL, rte);
+ return matchedrte;
+ }
+
+ /*
+ * Charge extra (for inexact matches only) when an alias was
+ * specified that differs from what might have been used to
+ * correctly qualify this RTE's closest column
+ */
+ if (wrongalias)
+ rtdistance += 3;
+
+ if (rteclosest != InvalidAttrNumber)
+ {
+ if (rtdistance >= distance)
+ {
+ /*
+ * Perhaps record this attribute as being just as close in
+ * distance to closest attribute observed so far across
+ * entire range table. Iff this distance is ultimately the
+ * lowest distance observed overall, it may end up as the
+ * second match.
+ */
+ if (rtdistance == distance)
+ {
+ closest[1] = rteclosest;
+ matchedrte = lappend(matchedrte, rte);
+ }
+
+ continue;
+ }
+
+ /*
+ * One best match (better than any others in previous RTEs) was
+ * found within this RTE
+ */
+ distance = rtdistance;
+ /* New uncontested best match */
+ matchedrte = lappend(NIL, rte);
+ closest[0] = rteclosest;
+ }
+ else
+ {
+ /*
+ * Even though there were perhaps multiple joint-best matches
+ * within this RTE (implying that there can be no attribute
+ * suggestion from it), the shortest distance should still
+ * serve as the distance for later RTEs to beat (but naturally
+ * only if it happens to be the lowest so far across the entire
+ * range table).
+ */
+ distance = Min(distance, rtdistance);
+ }
}
pstate = pstate->parentParseState;
}
- return NULL;
+
+ /*
+ * Too many equally close partial matches found?
+ *
+ * It's useful to provide two matches for the common case where two range
+ * tables each have one equally distant candidate column, as when an
+ * unqualified (and therefore would-be ambiguous) column name is specified
+ * which is also misspelled by the user. It seems unhelpful to show no
+ * hint when this occurs, since in practice one attribute probably
+ * references the other in a foreign key relationship. However, when there
+ * are more than 2 range tables with equally distant matches that's
+ * probably because the matches are not useful, so don't suggest anything.
+ */
+ if (list_length(matchedrte) > 2)
+ return NIL;
+
+ /*
+ * Handle dropped columns, which can appear here as empty colnames per
+ * remarks within scanRTEForColumn(). If either the first or second
+ * suggested attributes are dropped, do not provide any suggestion.
+ */
+ i = 0;
+ foreach(l, matchedrte)
+ {
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ char *closestcol;
+
+ closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+ if (strcmp(closestcol, "") == 0)
+ return NIL;
+ }
+
+ /*
+ * Distance must be less than a normalized threshold in order to avoid
+ * completely ludicrous suggestions. Note that a distance of 6 will be
+ * seen when 6 deletions are required against actual attribute name, or 3
+ * insertions/substitutions.
+ */
+ if (distance > 6 && distance > strlen(colname) * 2 / 2)
+ return NIL;
+
+ return matchedrte;
}
/*
@@ -2855,41 +3050,92 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
/*
* Generate a suitable error about a missing column.
*
- * Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * Since this is a very common type of error, we work rather hard to produce a
+ * helpful message, going so far as to guess user's intent when a missing
+ * column name is probably intended to reference one of two would-be ambiguous
+ * attributes (when no alias/qualification was provided).
*/
void
errorMissingColumn(ParseState *pstate,
char *relname, char *colname, int location)
{
- RangeTblEntry *rte;
+ List *matchedrte;
+ AttrNumber closest[2];
+ RangeTblEntry *rte1 = NULL,
+ *rte2 = NULL;
+ char *closestcol1;
+ char *closestcol2;
/*
- * If relname was given, just play dumb and report it. (In practice, a
- * bad qualification name should end up at errorMissingRTE, not here, so
- * no need to work hard on this case.)
+ * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+ * the event of an exact match there is only ever one suggestion
*/
- if (relname)
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column %s.%s does not exist", relname, colname),
- parser_errposition(pstate, location)));
+ closest[0] = closest[1] = InvalidAttrNumber;
/*
- * Otherwise, search the entire rtable looking for possible matches. If
- * we find one, emit a hint about it.
+ * Search the entire rtable looking for possible matches. If we find one,
+ * emit a hint about it.
*
* TODO: improve this code (and also errorMissingRTE) to mention using
* LATERAL if appropriate.
*/
- rte = searchRangeTableForCol(pstate, colname, location);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column \"%s\" does not exist", colname),
- rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
- colname, rte->eref->aliasname) : 0,
- parser_errposition(pstate, location)));
+ matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+ closest);
+
+ /*
+ * In practice a bad qualification name should end up at errorMissingRTE,
+ * not here, so no need to work hard on this case.
+ *
+ * Extract RTEs for best match, if any, and joint best match, if any.
+ */
+ if (matchedrte)
+ {
+ rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+ if (list_length(matchedrte) > 1)
+ rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+ if (rte1 && closest[0] != InvalidAttrNumber)
+ closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+ if (rte2 && closest[1] != InvalidAttrNumber)
+ closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+ }
+
+ if (!rte2)
+ {
+ /*
+ * Handle case where there is zero or one column suggestions to hint,
+ * including exact matches referenced but not visible.
+ *
+ * Infer an exact match referenced despite not being visible from the
+ * fact that an attribute number was not passed back.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ rte1? closest[0] != InvalidAttrNumber?
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1):
+ errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+ colname, rte1->eref->aliasname): 0,
+ parser_errposition(pstate, location)));
+ }
+ else
+ {
+ /* Handle case where there are two equally useful column hints */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1,
+ rte2->eref->aliasname, closestcol2),
+ parser_errposition(pstate, location)));
+ }
}
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
int rtelevelsup);
extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
- char *colname, int location);
+ char *colname, int location, AttrNumber *matchedatt,
+ int *distance);
extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
int location);
extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 9b89e58..77829dc 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
-- add a check constraint (fails)
alter table atacc1 add constraint atacc_test1 check (test1>3);
ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1"."test".
drop table atacc1;
-- something a little more complicated
create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altstartwith;
^
+HINT: Perhaps you meant to reference the column "altstartwith"."col".
select * from altstartwith;
col
-----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altinhoid;
^
+HINT: Perhaps you meant to reference the column "altinhoid"."col".
select * from altwithoid;
col
-----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 1cb1c51..f4edcbe 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
200 | 1000 | 200 | 2001
(5 rows)
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR: column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+ ^
+HINT: Perhaps you meant to reference the column "t3"."x".
--
-- regression test for 8.1 merge right join bug
--
@@ -3388,6 +3394,39 @@ select * from
(0 rows)
--
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR: column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR: column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR: column "uunique1" does not exist
+LINE 1: select uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+ERROR: column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+ ^
+HINT: Perhaps you meant to reference the column "atts"."indexrelid".
+--
-- Test LATERAL
--
select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 8892bb4..2cb4aa1 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4771,6 +4771,7 @@ END$$;
ERROR: column "foo" does not exist
LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
^
+HINT: Perhaps you meant to reference the column "room"."roomno".
QUERY: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
CONTEXT: PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
-- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname; -- error
ERROR: column fullname.text does not exist
LINE 1: select fullname.text from fullname;
^
+HINT: Perhaps you meant to reference the column "fullname"."last".
-- same, but RECORD instead of named composite type:
select cast (row('Jim', 'Beam') as text);
row
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..48c75fd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2368,6 +2368,7 @@ select xmin, * from fooview; -- fail, views don't have such a column
ERROR: column "xmin" does not exist
LINE 1: select xmin, * from fooview;
^
+HINT: Perhaps you meant to reference the column "fooview"."x".
select reltoastrelid, relkind, relfrozenxid
from pg_class where oid = 'fooview'::regclass;
reltoastrelid | relkind | relfrozenxid
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
ERROR: column "oid" does not exist
LINE 1: SELECT count(oid) FROM wo;
^
+HINT: Perhaps you meant to reference the column "wo"."i".
VACUUM ANALYZE wi;
VACUUM ANALYZE wo;
SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index fa3e068..4d60f9e 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
select * from t1 left join t2 on (t1.a = t2.a);
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
--
-- regression test for 8.1 merge right join bug
--
@@ -1047,6 +1051,26 @@ select * from
int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
--
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+--
-- Test LATERAL
--
--
2.0.1
I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.
I noticed in passing that your revision says this *within* levenshtein.c:
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Patch 2 is a rebase of the feature of Peter that can be applied on top of
patch 1. The code is rather untouched (haven't much played with Peter's
thingies), well-commented, but I think that this needs more work,
particularly when a query has a single RTE like in this case where no hints
are proposed to the user (mentioned upthread):
The only source of disagreement that I am aware of at this point is
the question of whether or not we should accept two candidates from
the same RTE. I lean slightly towards "no", as already explained [1]/messages/by-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com
[2]: /messages/by-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com -- Peter Geoghegan
approach of looking for only a single best candidate per RTE taken in
deference to the concerns of others.
I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.
[1]: /messages/by-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com
[2]: /messages/by-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.I noticed in passing that your revision says this *within* levenshtein.c:
+ * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c.
Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.
I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.
Fine for me. I'll move this patch to the next stage then.
--
Michael
Attachments:
0001-Move-Levenshtein-functions-to-core.patchtext/x-diff; charset=US-ASCII; name=0001-Move-Levenshtein-functions-to-core.patchDownload
From 65b66309444767129c81c6aee8df33a214bcf4c5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 18 Jul 2014 16:44:28 +0900
Subject: [PATCH 1/2] Move Levenshtein functions to core
All the functions, part of fuzzystrmatch, able to evaluate distances
between strings are moved into core:
- levenshtein
- levenshtein_less_equal
In order to unify the names of the functions in catalogs, the functions
with costs are appended a prefix *_with_costs.
Documentation, as well as regression tests are added. fuzzystrmatch is
dumped to 1.1 at the same occasion.
---
contrib/fuzzystrmatch/Makefile | 6 +-
contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql | 9 +
contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql | 44 --
contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql | 28 ++
contrib/fuzzystrmatch/fuzzystrmatch.c | 69 ---
contrib/fuzzystrmatch/fuzzystrmatch.control | 2 +-
contrib/fuzzystrmatch/levenshtein.c | 403 ---------------
doc/src/sgml/func.sgml | 68 +++
doc/src/sgml/fuzzystrmatch.sgml | 66 ---
src/backend/utils/adt/Makefile | 4 +-
src/backend/utils/adt/levenshtein.c | 565 ++++++++++++++++++++++
src/include/catalog/pg_proc.h | 10 +
src/include/utils/builtins.h | 6 +
src/include/utils/levenshtein.h | 27 ++
src/test/regress/expected/levenshtein.out | 27 ++
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/levenshtein.sql | 8 +
18 files changed, 755 insertions(+), 590 deletions(-)
create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
delete mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
create mode 100644 src/backend/utils/adt/levenshtein.c
create mode 100644 src/include/utils/levenshtein.h
create mode 100644 src/test/regress/expected/levenshtein.out
create mode 100644 src/test/regress/sql/levenshtein.sql
diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..3d3c773 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -4,7 +4,8 @@ MODULE_big = fuzzystrmatch
OBJS = fuzzystrmatch.o dmetaphone.o $(WIN32RES)
EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql
+DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql \
+ fuzzystrmatch--1.0--1.1.sql
PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
ifdef USE_PGXS
@@ -17,6 +18,3 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
new file mode 100644
index 0000000..0fca2a6
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
@@ -0,0 +1,9 @@
+/* contrib/pageinspect/fuzzystrmatch--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO 1.1" to load this file. \quit
+
+DROP FUNCTION levenshtein (text,text);
+DROP FUNCTION levenshtein (text,text,int,int,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int,int,int,int);
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
deleted file mode 100644
index 1cf9b61..0000000
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
+++ /dev/null
@@ -1,44 +0,0 @@
-/* contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
-
-CREATE FUNCTION levenshtein (text,text) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein (text,text,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION metaphone (text,int) RETURNS text
-AS 'MODULE_PATHNAME','metaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION text_soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION difference(text,text) RETURNS int
-AS 'MODULE_PATHNAME', 'difference'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone_alt (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone_alt'
-LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
new file mode 100644
index 0000000..a4861ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
+
+CREATE FUNCTION metaphone (text,int) RETURNS text
+AS 'MODULE_PATHNAME','metaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION text_soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION difference(text,text) RETURNS int
+AS 'MODULE_PATHNAME', 'difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone_alt (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone_alt'
+LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..9923c17 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -40,7 +40,6 @@
#include <ctype.h>
-#include "mb/pg_wchar.h"
#include "utils/builtins.h"
PG_MODULE_MAGIC;
@@ -154,74 +153,6 @@ getcode(char c)
/* These prevent GH from becoming F */
#define NOGHTOF(c) (getcode(c) & 16) /* BDH */
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
- while (len > 0)
- {
- len--;
- if (s1[len] != s2[len])
- return false;
- }
- return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
-PG_FUNCTION_INFO_V1(levenshtein_with_costs);
-Datum
-levenshtein_with_costs(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int ins_c = PG_GETARG_INT32(2);
- int del_c = PG_GETARG_INT32(3);
- int sub_c = PG_GETARG_INT32(4);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein);
-Datum
-levenshtein(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal_with_costs);
-Datum
-levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int ins_c = PG_GETARG_INT32(2);
- int del_c = PG_GETARG_INT32(3);
- int sub_c = PG_GETARG_INT32(4);
- int max_d = PG_GETARG_INT32(5);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal);
-Datum
-levenshtein_less_equal(PG_FUNCTION_ARGS)
-{
- text *src = PG_GETARG_TEXT_PP(0);
- text *dst = PG_GETARG_TEXT_PP(1);
- int max_d = PG_GETARG_INT32(2);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
-}
-
/*
* Calculates the metaphone of an input string.
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index e257f09..6b2832a 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,5 +1,5 @@
# fuzzystrmatch extension
comment = 'determine similarities and distance between strings'
-default_version = '1.0'
+default_version = '1.1'
module_pathname = '$libdir/fuzzystrmatch'
relocatable = true
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN 255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t. The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification. In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound. From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row. This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length. The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix. When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c)
-#endif
-{
- int m,
- n,
- s_bytes,
- t_bytes;
- int *prev;
- int *curr;
- int *s_char_len = NULL;
- int i,
- j;
- const char *s_data;
- const char *t_data;
- const char *y;
-
- /*
- * For levenshtein_less_equal_internal, we have real variables called
- * start_column and stop_column; otherwise it's just short-hand for 0 and
- * m.
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
- int start_column,
- stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
- /* Extract a pointer to the actual character data. */
- s_data = VARDATA_ANY(s);
- t_data = VARDATA_ANY(t);
-
- /* Determine length of each string in bytes and characters. */
- s_bytes = VARSIZE_ANY_EXHDR(s);
- t_bytes = VARSIZE_ANY_EXHDR(t);
- m = pg_mbstrlen_with_len(s_data, s_bytes);
- n = pg_mbstrlen_with_len(t_data, t_bytes);
-
- /*
- * We can transform an empty s into t with n insertions, or a non-empty t
- * into an empty s with m deletions.
- */
- if (!m)
- return n * ins_c;
- if (!n)
- return m * del_c;
-
- /*
- * For security concerns, restrict excessive CPU+RAM usage. (This
- * implementation uses O(m) memory and has O(mn) complexity.)
- */
- if (m > MAX_LEVENSHTEIN_STRLEN ||
- n > MAX_LEVENSHTEIN_STRLEN)
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("argument exceeds the maximum length of %d bytes",
- MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
- /* Initialize start and stop columns. */
- start_column = 0;
- stop_column = m + 1;
-
- /*
- * If max_d >= 0, determine whether the bound is impossibly tight. If so,
- * return max_d + 1 immediately. Otherwise, determine whether it's tight
- * enough to limit the computation we must perform. If so, figure out
- * initial stop column.
- */
- if (max_d >= 0)
- {
- int min_theo_d; /* Theoretical minimum distance. */
- int max_theo_d; /* Theoretical maximum distance. */
- int net_inserts = n - m;
-
- min_theo_d = net_inserts < 0 ?
- -net_inserts * del_c : net_inserts * ins_c;
- if (min_theo_d > max_d)
- return max_d + 1;
- if (ins_c + del_c < sub_c)
- sub_c = ins_c + del_c;
- max_theo_d = min_theo_d + sub_c * Min(m, n);
- if (max_d >= max_theo_d)
- max_d = -1;
- else if (ins_c + del_c > 0)
- {
- /*
- * Figure out how much of the first row of the notional matrix we
- * need to fill in. If the string is growing, the theoretical
- * minimum distance already incorporates the cost of deleting the
- * number of characters necessary to make the two strings equal in
- * length. Each additional deletion forces another insertion, so
- * the best-case total cost increases by ins_c + del_c. If the
- * string is shrinking, the minimum theoretical cost assumes no
- * excess deletions; that is, we're starting no further right than
- * column n - m. If we do start further right, the best-case
- * total cost increases by ins_c + del_c for each move right.
- */
- int slack_d = max_d - min_theo_d;
- int best_column = net_inserts < 0 ? -net_inserts : 0;
-
- stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
- if (stop_column > m)
- stop_column = m + 1;
- }
- }
-#endif
-
- /*
- * In order to avoid calling pg_mblen() repeatedly on each character in s,
- * we cache all the lengths before starting the main loop -- but if all
- * the characters in both strings are single byte, then we skip this and
- * use a fast-path in the main loop. If only one string contains
- * multi-byte characters, we still build the array, so that the fast-path
- * needn't deal with the case where the array hasn't been initialized.
- */
- if (m != s_bytes || n != t_bytes)
- {
- int i;
- const char *cp = s_data;
-
- s_char_len = (int *) palloc((m + 1) * sizeof(int));
- for (i = 0; i < m; ++i)
- {
- s_char_len[i] = pg_mblen(cp);
- cp += s_char_len[i];
- }
- s_char_len[i] = 0;
- }
-
- /* One more cell for initialization column and row. */
- ++m;
- ++n;
-
- /* Previous and current rows of notional array. */
- prev = (int *) palloc(2 * m * sizeof(int));
- curr = prev + m;
-
- /*
- * To transform the first i characters of s into the first 0 characters of
- * t, we must perform i deletions.
- */
- for (i = START_COLUMN; i < STOP_COLUMN; i++)
- prev[i] = i * del_c;
-
- /* Loop through rows of the notional array */
- for (y = t_data, j = 1; j < n; j++)
- {
- int *temp;
- const char *x = s_data;
- int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * In the best case, values percolate down the diagonal unchanged, so
- * we must increment stop_column unless it's already on the right end
- * of the array. The inner loop will read prev[stop_column], so we
- * have to initialize it even though it shouldn't affect the result.
- */
- if (stop_column < m)
- {
- prev[stop_column] = max_d + 1;
- ++stop_column;
- }
-
- /*
- * The main loop fills in curr, but curr[0] needs a special case: to
- * transform the first 0 characters of s into the first j characters
- * of t, we must perform j insertions. However, if start_column > 0,
- * this special case does not apply.
- */
- if (start_column == 0)
- {
- curr[0] = j * ins_c;
- i = 1;
- }
- else
- i = start_column;
-#else
- curr[0] = j * ins_c;
- i = 1;
-#endif
-
- /*
- * This inner loop is critical to performance, so we include a
- * fast-path to handle the (fairly common) case where no multibyte
- * characters are in the mix. The fast-path is entitled to assume
- * that if s_char_len is not initialized then BOTH strings contain
- * only single-byte characters.
- */
- if (s_char_len != NULL)
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
- int x_char_len = s_char_len[i - 1];
-
- /*
- * Calculate costs for insertion, deletion, and substitution.
- *
- * When calculating cost for substitution, we compare the last
- * character of each possibly-multibyte character first,
- * because that's enough to rule out most mis-matches. If we
- * get past that test, then we compare the lengths and the
- * remaining bytes.
- */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- if (x[x_char_len - 1] == y[y_char_len - 1]
- && x_char_len == y_char_len &&
- (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
- sub = prev[i - 1];
- else
- sub = prev[i - 1] + sub_c;
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x += x_char_len;
- }
- }
- else
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
-
- /* Calculate costs for insertion, deletion, and substitution. */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x++;
- }
- }
-
- /* Swap current row with previous row. */
- temp = curr;
- curr = prev;
- prev = temp;
-
- /* Point to next character. */
- y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * This chunk of code represents a significant performance hit if used
- * in the case where there is no max_d bound. This is probably not
- * because the max_d >= 0 test itself is expensive, but rather because
- * the possibility of needing to execute this code prevents tight
- * optimization of the loop as a whole.
- */
- if (max_d >= 0)
- {
- /*
- * The "zero point" is the column of the current row where the
- * remaining portions of the strings are of equal length. There
- * are (n - 1) characters in the target string, of which j have
- * been transformed. There are (m - 1) characters in the source
- * string, so we want to find the value for zp where (n - 1) - j =
- * (m - 1) - zp.
- */
- int zp = j - (n - m);
-
- /* Check whether the stop column can slide left. */
- while (stop_column > 0)
- {
- int ii = stop_column - 1;
- int net_inserts = ii - zp;
-
- if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
- stop_column--;
- }
-
- /* Check whether the start column can slide right. */
- while (start_column < stop_column)
- {
- int net_inserts = start_column - zp;
-
- if (prev[start_column] +
- (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
-
- /*
- * We'll never again update these values, so we must make sure
- * there's nothing here that could confuse any future
- * iteration of the outer loop.
- */
- prev[start_column] = max_d + 1;
- curr[start_column] = max_d + 1;
- if (start_column != 0)
- s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
- start_column++;
- }
-
- /* If they cross, we're going to exceed the bound. */
- if (start_column >= stop_column)
- return max_d + 1;
- }
-#endif
- }
-
- /*
- * Because the final value was swapped from the previous row to the
- * current row, that's where we'll find it.
- */
- return prev[m - 1];
-}
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index bf13140..84ae29c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -7893,6 +7893,74 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
</para>
</sect1>
+ <sect1 id="functions-levenshtein">
+ <title>Levenshtein functions</title>
+
+ <para>
+ Levenshtein functions provide ways to calculate a distance between two
+ strings.
+ </para>
+
+ <table id="functions-levenshtein-table">
+ <title>Levenshtein Functions</title>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Function</entry>
+ <entry>Description</entry>
+ <entry>Example</entry>
+ <entry>Example Result</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein</primary>
+ </indexterm>
+ <literal>levenshtein(text source, text target [, int ins_cost, int del_cost, int sub_cost])</literal>
+ </entry>
+ <entry>
+ Returns the distance between the two given strings <literal>source</>
+ and <literal>target</>.
+ </entry>
+ <entry><literal>levenshtein('GUMBO', 'GAMBOL')</literal></entry>
+ <entry><literal>2</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>levenshtein_less_equal</primary>
+ </indexterm>
+ <literal>levenshtein_less_equal(text source, text target [, int ins_cost, int del_cost, int sub_cost], int max_d)</literal>
+ </entry>
+ <entry>
+ Returns the less-equal distance between the two strings
+ <literal>source</> and <literal>target</>.
+ </entry>
+ <entry><literal>levenshtein_less_equal('extensive', 'exhaustive', 2)</literal></entry>
+ <entry><literal>3</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <para>
+ Both <literal>source</literal> and <literal>target</literal> can be any
+ non-null string, with a maximum of 255 bytes. The cost parameters
+ <literal>ins_cost</>, <literal>del_cost</> and <literal>sub_cost</>
+ specify how much to charge for a character insertion, deletion, or
+ substitution, respectively. You can omit the cost parameters, as in
+ the second version of the function; in that case they all default to 1.
+ <literal>levenshtein_less_equal</literal> is accelerated version of
+ levenshtein function for low values of distance. If actual distance
+ is less or equal then <literal>max_d</>, then
+ <literal>levenshtein_less_equal</literal> returns accurate value of it.
+ Otherwise this function returns value which is greater than
+ <literal>max_d</>.
+ </para>
+ </sect1>
+
<sect1 id="functions-geometry">
<title>Geometric Functions and Operators</title>
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index f26bd90..f95d5aa 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -83,72 +83,6 @@ SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
</sect2>
<sect2>
- <title>Levenshtein</title>
-
- <para>
- This function calculates the Levenshtein distance between two strings:
- </para>
-
- <indexterm>
- <primary>levenshtein</primary>
- </indexterm>
-
- <indexterm>
- <primary>levenshtein_less_equal</primary>
- </indexterm>
-
-<synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
-</synopsis>
-
- <para>
- Both <literal>source</literal> and <literal>target</literal> can be any
- non-null string, with a maximum of 255 bytes. The cost parameters
- specify how much to charge for a character insertion, deletion, or
- substitution, respectively. You can omit the cost parameters, as in
- the second version of the function; in that case they all default to 1.
- <literal>levenshtein_less_equal</literal> is accelerated version of
- levenshtein function for low values of distance. If actual distance
- is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
- returns accurate value of it. Otherwise this function returns value
- which is greater than max_d.
- </para>
-
- <para>
- Examples:
- </para>
-
-<screen>
-test=# SELECT levenshtein('GUMBO', 'GAMBOL');
- levenshtein
--------------
- 2
-(1 row)
-
-test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
- levenshtein
--------------
- 3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
- levenshtein_less_equal
-------------------------
- 3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
- levenshtein_less_equal
-------------------------
- 4
-(1 row)
-</screen>
- </sect2>
-
- <sect2>
<title>Metaphone</title>
<para>
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..7071afe 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,8 +22,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
encode.o enum.o float.o format_type.o formatting.o genfile.o \
geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
int8.o json.o jsonb.o jsonb_gin.o jsonb_op.o jsonb_util.o \
- jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \
- network.o network_gist.o network_selfuncs.o \
+ jsonfuncs.o levenshtein.o like.o lockfuncs.o mac.o misc.o nabstime.o \
+ name.o network.o network_gist.o network_selfuncs.o \
numeric.o numutils.o oid.o oracle_compat.o \
orderedsetaggs.o pg_lzcompress.o pg_locale.o pg_lsn.o \
pgstatfuncs.o pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..e405ad8
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,565 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ * Levenshtein distance implementation.
+ *
+ * Original author: Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costsings and a "max" value above which exact distances are not
+ * interesting. Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * All arguments should be strlen(s) <= MAX_LEVENSHTEIN_STRLEN.
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm. Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration. Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/levenshtein.h"
+
+#include "fmgr.h"
+#include "utils/builtins.h"
+
+#include "mb/pg_wchar.h"
+
+/*
+ * Maximum length of strings authorized for distance calculation.
+ */
+#define MAX_LEVENSHTEIN_STRLEN 255
+
+/*
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+ while (len > 0)
+ {
+ len--;
+ if (s1[len] != s2[len])
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Calculates Levenshtein distance metric between supplied cstrings, which are
+ * not necessarily null-terminated. Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t. The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification. In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound. From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row. This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length. The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix. When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+
+/*
+ * levenshtein_common
+ *
+ * Common routine for all Levenstein functions.
+ */
+static int
+levenshtein_common(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+ int m, n;
+ int *prev;
+ int *curr;
+ int *s_char_len = NULL;
+ int i,
+ j;
+ const char *y;
+ int max_init;
+ int start_column,
+ stop_column;
+ int start_column_local, stop_column_local;
+
+ /* Save value of max_d */
+ max_init = max_d;
+
+ m = pg_mbstrlen_with_len(source, slen);
+ n = pg_mbstrlen_with_len(target, tlen);
+
+ /*
+ * We can transform an empty s into t with n insertions, or a non-empty t
+ * into an empty s with m deletions.
+ */
+ if (!m)
+ return n * ins_c;
+ if (!n)
+ return m * del_c;
+
+ /*
+ * A common use for Levenshtein distance is to match column names.
+ * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+ * guaranteed to work.
+ */
+ StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+ "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+ /*
+ * For security concerns, restrict excessive CPU+RAM usage. (This
+ * implementation uses O(m) memory and has O(mn) complexity.)
+ */
+ if (m > MAX_LEVENSHTEIN_STRLEN ||
+ n > MAX_LEVENSHTEIN_STRLEN)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("argument exceeds the maximum length of %d bytes",
+ MAX_LEVENSHTEIN_STRLEN)));
+
+ /* This optimization is done only for less-equal calculation */
+ if (max_init >= 0)
+ {
+ /* Initialize start and stop columns. */
+ start_column = 0;
+ stop_column = m + 1;
+
+ /*
+ * If max_d >= 0, determine whether the bound is impossibly tight. If so,
+ * return max_d + 1 immediately. Otherwise, determine whether it's tight
+ * enough to limit the computation we must perform. If so, figure out
+ * initial stop column.
+ */
+ if (max_d >= 0)
+ {
+ int min_theo_d; /* Theoretical minimum distance. */
+ int max_theo_d; /* Theoretical maximum distance. */
+ int net_inserts = n - m;
+
+ min_theo_d = net_inserts < 0 ?
+ -net_inserts * del_c : net_inserts * ins_c;
+ if (min_theo_d > max_d)
+ return max_d + 1;
+ if (ins_c + del_c < sub_c)
+ sub_c = ins_c + del_c;
+ max_theo_d = min_theo_d + sub_c * Min(m, n);
+ if (max_d >= max_theo_d)
+ max_d = -1;
+ else if (ins_c + del_c > 0)
+ {
+ /*
+ * Figure out how much of the first row of the notional matrix we
+ * need to fill in. If the string is growing, the theoretical
+ * minimum distance already incorporates the cost of deleting the
+ * number of characters necessary to make the two strings equal in
+ * length. Each additional deletion forces another insertion, so
+ * the best-case total cost increases by ins_c + del_c. If the
+ * string is shrinking, the minimum theoretical cost assumes no
+ * excess deletions; that is, we're starting no further right than
+ * column n - m. If we do start further right, the best-case
+ * total cost increases by ins_c + del_c for each move right.
+ */
+ int slack_d = max_d - min_theo_d;
+ int best_column = net_inserts < 0 ? -net_inserts : 0;
+
+ stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+ if (stop_column > m)
+ stop_column = m + 1;
+ }
+ }
+ }
+ else
+ {
+ /*
+ * Be sure to set if correctly stop and start columns in all cases.
+ */
+ start_column = 0;
+ stop_column = m;
+ }
+
+ /*
+ * In order to avoid calling pg_mblen() repeatedly on each character in s,
+ * we cache all the lengths before starting the main loop -- but if all
+ * the characters in both strings are single byte, then we skip this and
+ * use a fast-path in the main loop. If only one string contains
+ * multi-byte characters, we still build the array, so that the fast-path
+ * needn't deal with the case where the array hasn't been initialized.
+ */
+ if (m != slen || n != tlen)
+ {
+ int i;
+ const char *cp = source;
+
+ s_char_len = (int *) palloc((m + 1) * sizeof(int));
+ for (i = 0; i < m; ++i)
+ {
+ s_char_len[i] = pg_mblen(cp);
+ cp += s_char_len[i];
+ }
+ s_char_len[i] = 0;
+ }
+
+ /* One more cell for initialization column and row. */
+ ++m;
+ ++n;
+
+ /* Previous and current rows of notional array. */
+ prev = (int *) palloc(2 * m * sizeof(int));
+ curr = prev + m;
+
+ /*
+ * To transform the first i characters of s into the first 0 characters of
+ * t, we must perform i deletions.
+ */
+ if (max_init >= 0)
+ stop_column_local = stop_column;
+ else
+ stop_column_local = m;
+
+ for (i = 0; i < stop_column_local; i++)
+ prev[i] = i * del_c;
+
+ /* Loop through rows of the notional array */
+ for (y = target, j = 1; j < n; j++)
+ {
+ int *temp;
+ const char *x = source;
+ int y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+ /* This optimization is done only for less-equal calculation */
+ if (max_init >= 0)
+ {
+ /*
+ * In the best case, values percolate down the diagonal unchanged, so
+ * we must increment stop_column unless it's already on the right end
+ * of the array. The inner loop will read prev[stop_column], so we
+ * have to initialize it even though it shouldn't affect the result.
+ */
+ if (stop_column < m)
+ {
+ prev[stop_column] = max_d + 1;
+ ++stop_column;
+ }
+
+ /*
+ * The main loop fills in curr, but curr[0] needs a special case: to
+ * transform the first 0 characters of s into the first j characters
+ * of t, we must perform j insertions. However, if start_column > 0,
+ * this special case does not apply.
+ */
+ if (start_column == 0)
+ {
+ curr[0] = j * ins_c;
+ i = 1;
+ }
+ else
+ i = start_column;
+ }
+ else
+ {
+ curr[0] = j * ins_c;
+ i = 1;
+ }
+
+ /*
+ * This inner loop is critical to performance, so we include a
+ * fast-path to handle the (fairly common) case where no multibyte
+ * characters are in the mix. The fast-path is entitled to assume
+ * that if s_char_len is not initialized then BOTH strings contain
+ * only single-byte characters.
+ */
+ if (s_char_len != NULL)
+ {
+ if (max_init < 0)
+ stop_column_local = m;
+ else
+ stop_column_local = stop_column;
+
+ for (; i < stop_column_local; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+ int x_char_len = s_char_len[i - 1];
+
+ /*
+ * Calculate costs for insertion, deletion, and substitution.
+ *
+ * When calculating cost for substitution, we compare the last
+ * character of each possibly-multibyte character first,
+ * because that's enough to rule out most mis-matches. If we
+ * get past that test, then we compare the lengths and the
+ * remaining bytes.
+ */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ if (x[x_char_len - 1] == y[y_char_len - 1]
+ && x_char_len == y_char_len &&
+ (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+ sub = prev[i - 1];
+ else
+ sub = prev[i - 1] + sub_c;
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x += x_char_len;
+ }
+ }
+ else
+ {
+ if (max_init < 0)
+ stop_column_local = m;
+ else
+ stop_column_local = stop_column;
+
+ for (; i < stop_column_local; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+
+ /* Calculate costs for insertion, deletion, and substitution. */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x++;
+ }
+ }
+
+ /* Swap current row with previous row. */
+ temp = curr;
+ curr = prev;
+ prev = temp;
+
+ /* Point to next character. */
+ y += y_char_len;
+
+ /*
+ * This chunk of code represents a significant performance hit if used
+ * in the case where there is no max_d bound. This is probably not
+ * because the max_d >= 0 test itself is expensive, but rather because
+ * the possibility of needing to execute this code prevents tight
+ * optimization of the loop as a whole.
+ */
+ if (max_init >= 0 && max_d >= 0)
+ {
+ /*
+ * The "zero point" is the column of the current row where the
+ * remaining portions of the strings are of equal length. There
+ * are (n - 1) characters in the target string, of which j have
+ * been transformed. There are (m - 1) characters in the source
+ * string, so we want to find the value for zp where (n - 1) - j =
+ * (m - 1) - zp.
+ */
+ int zp = j - (n - m);
+
+ /* Check whether the stop column can slide left. */
+ while (stop_column > 0)
+ {
+ int ii = stop_column - 1;
+ int net_inserts = ii - zp;
+
+ if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+ stop_column--;
+ }
+
+ /* Check whether the start column can slide right. */
+ while (start_column < stop_column)
+ {
+ int net_inserts = start_column - zp;
+
+ if (prev[start_column] +
+ (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+
+ /*
+ * We'll never again update these values, so we must make sure
+ * there's nothing here that could confuse any future
+ * iteration of the outer loop.
+ */
+ prev[start_column] = max_d + 1;
+ curr[start_column] = max_d + 1;
+ if (start_column != 0)
+ source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+ start_column++;
+ }
+
+ /* If they cross, we're going to exceed the bound. */
+ if (start_column >= stop_column)
+ return max_d + 1;
+ }
+ }
+
+ /*
+ * Because the final value was swapped from the previous row to the
+ * current row, that's where we'll find it.
+ */
+ return prev[m - 1];
+}
+
+/*
+ * levenshtein_internal
+ * levenshtein_less_equal_internal
+ *
+ * Internal procedures for Levenshtein distance calculation.
+ */
+int
+levenshtein_less_equal_internal(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+ return levenshtein_common(source, slen, target, tlen, ins_c,
+ del_c, sub_c, max_d);
+}
+
+int
+levenshtein_internal(const char *source, int slen, const char *target, int tlen,
+ int ins_c, int del_c, int sub_c)
+{
+ return levenshtein_common(source, slen, target, tlen, ins_c,
+ del_c, sub_c, -1);
+}
+
+/*
+ * levenshtein
+ *
+ * Calculate the Levenshtein distance of two strings.
+ */
+Datum
+levenshtein(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+ t_bytes, 1, 1, 1));
+}
+
+/*
+ * levenshtein_with_costs
+ *
+ * Improved version of levenshtein with costs for character insertion,
+ * deletion and substitution.
+ */
+Datum
+levenshtein_with_costs(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int ins_c = PG_GETARG_INT32(2);
+ int del_c = PG_GETARG_INT32(3);
+ int sub_c = PG_GETARG_INT32(4);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+ t_bytes, ins_c, del_c, sub_c));
+}
+
+/*
+ * levenshtein_less_equal
+ *
+ * Accelerated version of levenshtein for low distances.
+ */
+Datum
+levenshtein_less_equal(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int max_d = PG_GETARG_INT32(2);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+ t_data, t_bytes, 1, 1, 1, max_d));
+}
+
+/*
+ * levenshtein_less_equal_with_costs
+ *
+ * Accelerated version of levenshtein for low distances with costs for
+ * character insertion, deletion and substitution.
+ */
+Datum
+levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
+{
+ text *src = PG_GETARG_TEXT_PP(0);
+ text *dst = PG_GETARG_TEXT_PP(1);
+ int ins_c = PG_GETARG_INT32(2);
+ int del_c = PG_GETARG_INT32(3);
+ int sub_c = PG_GETARG_INT32(4);
+ int max_d = PG_GETARG_INT32(5);
+ const char *s_data;
+ const char *t_data;
+ int s_bytes, t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+ t_data, t_bytes, ins_c, del_c, sub_c, max_d));
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..e7bd6ff 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4973,6 +4973,16 @@ DESCR("peek at changes from replication slot");
DATA(insert OID = 3785 ( pg_logical_slot_peek_binary_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,17}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ pg_logical_slot_peek_binary_changes _null_ _null_ _null_ ));
DESCR("peek at binary changes from replication slot");
+/* levenshtein distance */
+DATA(insert OID = 3366 ( levenshtein PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "25 25" _null_ _null_ _null_ _null_ levenshtein _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings");
+DATA(insert OID = 3367 ( levenshtein PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 23 "25 25 23 23 23" _null_ _null_ _null_ _null_ levenshtein_with_costs _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings with costs");
+DATA(insert OID = 3368 ( levenshtein_less_equal PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 23 "25 25 23" _null_ _null_ _null_ _null_ levenshtein_less_equal _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings");
+DATA(insert OID = 3369 ( levenshtein_less_equal PGNSP PGUID 12 1 0 0 0 f f f f t f i 6 0 23 "25 25 23 23 23 23" _null_ _null_ _null_ _null_ levenshtein_less_equal_with_costs _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings with costs");
+
/* event triggers */
DATA(insert OID = 3566 ( pg_event_trigger_dropped_objects PGNSP PGUID 12 10 100 0 0 f f f f t t s 0 0 2249 "" "{26,26,23,25,25,25,25}" "{o,o,o,o,o,o,o}" "{classid, objid, objsubid, object_type, schema_name, object_name, object_identity}" _null_ pg_event_trigger_dropped_objects _null_ _null_ _null_ ));
DESCR("list objects dropped by the current command");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index bbb5d39..b468c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -851,6 +851,12 @@ extern Datum cidrecv(PG_FUNCTION_ARGS);
extern Datum cidsend(PG_FUNCTION_ARGS);
extern Datum cideq(PG_FUNCTION_ARGS);
+/* levenshtein.c */
+extern Datum levenshtein(PG_FUNCTION_ARGS);
+extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+
/* like.c */
extern Datum namelike(PG_FUNCTION_ARGS);
extern Datum namenlike(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/levenshtein.h b/src/include/utils/levenshtein.h
new file mode 100644
index 0000000..6fa01d0
--- /dev/null
+++ b/src/include/utils/levenshtein.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.h
+ * Header file for the Levenshtein distance functions.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * src/include/utils/levenshtein.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+#include "postgres.h"
+
+/* Internal functions */
+extern int levenshtein_less_equal_internal(const char *source, int slen,
+ const char *target, int tlen,
+ int ins_c, int del_c, int sub_c, int max_d);
+
+extern int levenshtein_internal(const char *source, int slen,
+ const char *target, int tlen,
+ int ins_c, int del_c, int sub_c);
+
+#endif
diff --git a/src/test/regress/expected/levenshtein.out b/src/test/regress/expected/levenshtein.out
new file mode 100644
index 0000000..e5a77c3
--- /dev/null
+++ b/src/test/regress/expected/levenshtein.out
@@ -0,0 +1,27 @@
+--
+-- LEVENSHTEIN
+--
+SELECT levenshtein('GUMBO', 'GAMBOL');
+ levenshtein
+-------------
+ 2
+(1 row)
+
+SELECT levenshtein('GUMBO', 'GAMBOL', 2, 1, 1);
+ levenshtein
+-------------
+ 3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+ levenshtein_less_equal
+------------------------
+ 3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 1, 1, 1, 4);
+ levenshtein_less_equal
+------------------------
+ 4
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c0416f4..5faf182 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: tablespace
# ----------
# The first group of parallel tests
# ----------
-test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
+test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc levenshtein
# Depends on things setup during char, varchar and text
test: strings
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 16a1905..e980619 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -21,6 +21,7 @@ test: money
test: rangetypes
test: pg_lsn
test: regproc
+test: levenshtein
test: strings
test: numerology
test: point
diff --git a/src/test/regress/sql/levenshtein.sql b/src/test/regress/sql/levenshtein.sql
new file mode 100644
index 0000000..7806bde
--- /dev/null
+++ b/src/test/regress/sql/levenshtein.sql
@@ -0,0 +1,8 @@
+--
+-- LEVENSHTEIN
+--
+
+SELECT levenshtein('GUMBO', 'GAMBOL');
+SELECT levenshtein('GUMBO', 'GAMBOL', 2, 1, 1);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 1, 1, 1, 4);
--
2.0.1
0002-Support-for-column-hints.patchtext/x-diff; charset=US-ASCII; name=0002-Support-for-column-hints.patchDownload
From 7574945acced9af579e32f5e216483a1a50e0218 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:58:25 +0900
Subject: [PATCH 2/2] Support for column hints
If incorrect column names are written in a query, system tries to
evaluate if there are columns on existing RTEs that are close in
distance to the one mistaken, and returns to user hints according
to the evaluation done.
---
src/backend/parser/parse_expr.c | 9 +-
src/backend/parser/parse_func.c | 2 +-
src/backend/parser/parse_relation.c | 318 ++++++++++++++++++++++++++----
src/include/parser/parse_relation.h | 3 +-
src/test/regress/expected/alter_table.out | 8 +
src/test/regress/expected/join.out | 39 ++++
src/test/regress/expected/plpgsql.out | 1 +
src/test/regress/expected/rowtypes.out | 1 +
src/test/regress/expected/rules.out | 1 +
src/test/regress/expected/without_oid.out | 1 +
src/test/regress/sql/join.sql | 24 +++
11 files changed, 366 insertions(+), 41 deletions(-)
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field2);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field3);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field4);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
((Var *) first_arg)->varno,
((Var *) first_arg)->varlevelsup);
/* Return a Var if funcname matches a column, else NULL */
- return scanRTEForColumn(pstate, rte, funcname, location);
+ return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
}
/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..2838f89 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include <ctype.h>
+#include <limits.h>
#include "access/htup_details.h"
#include "access/sysattr.h"
@@ -28,6 +29,7 @@
#include "parser/parse_relation.h"
#include "parser/parse_type.h"
#include "utils/builtins.h"
+#include "utils/levenshtein.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/syscache.h"
@@ -520,6 +522,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
}
/*
+ * distanceName
+ * Return Levenshtein distance between an actual column name and possible
+ * partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+ int len = strlen(actual),
+ match_len = strlen(match);
+
+ /* Charge half as much per deletion as per insertion or per substitution */
+ return levenshtein_less_equal_internal(actual, len, match, match_len,
+ 2, 1, 2, max);
+}
+
+/*
* scanRTEForColumn
* Search the column names of a single RTE for the given name.
* If found, return an appropriate Var node, else return NULL.
@@ -527,10 +545,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
*
* Side effect: if we find a match, mark the RTE as requiring read access
* for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance. Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match). This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful. Note that we never consider system column names
+ * when performing fuzzy matching.
*/
Node *
scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
- int location)
+ int location, AttrNumber *closest, int *distance)
{
Node *result = NULL;
int attnum = 0;
@@ -548,12 +580,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
* Should this somehow go wrong and we try to access a dropped column,
* we'll still catch it by virtue of the checks in
* get_rte_attribute_type(), which is called by make_var(). That routine
- * has to do a cache lookup anyway, so the check there is cheap.
+ * has to do a cache lookup anyway, so the check there is cheap. Callers
+ * interested in finding match with shortest distance need to defend
+ * against this directly, though.
*/
foreach(c, rte->eref->colnames)
{
+ const char *attcolname = strVal(lfirst(c));
+
attnum++;
- if (strcmp(strVal(lfirst(c)), colname) == 0)
+ if (strcmp(attcolname, colname) == 0)
{
if (result)
ereport(ERROR,
@@ -566,6 +602,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
markVarForSelectPriv(pstate, var, rte);
result = (Node *) var;
}
+
+ if (distance && *distance != 0)
+ {
+ if (result)
+ {
+ /* Exact match just found */
+ *distance = 0;
+ }
+ else
+ {
+ int lowestdistance = *distance;
+ int thisdistance = distanceName(attcolname, colname,
+ lowestdistance);
+
+ if (thisdistance >= lowestdistance)
+ {
+ /*
+ * This match distance may equal a prior match within this
+ * same range table. When that happens, the prior match is
+ * discarded as worthless, since a single best match is
+ * required within a RTE.
+ */
+ if (thisdistance == lowestdistance)
+ *closest = InvalidAttrNumber;
+
+ continue;
+ }
+
+ /* Store new lowest observed distance for RT */
+ *distance = thisdistance;
+ }
+ *closest = attnum;
+ }
}
/*
@@ -642,7 +711,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
continue;
/* use orig_pstate here to get the right sublevels_up */
- newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+ newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+ NULL, NULL);
if (newresult)
{
@@ -668,8 +738,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
/*
* searchRangeTableForCol
- * See if any RangeTblEntry could possibly provide the given column name.
- * If so, return a pointer to the RangeTblEntry; else return NULL.
+ * See if any RangeTblEntry could possibly provide the given column name (or
+ * find the best match available). Returns a list of equally likely
+ * candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily; we provide the closet columns if there
+ * was not an exact match. Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
*
* This is different from colNameToVar in that it considers every entry in
* the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +754,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
* matches, but only one will be returned). This must be used ONLY as a
* heuristic in giving suitable error messages. See errorMissingColumn.
*/
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+ int location, AttrNumber closest[2])
{
- ParseState *orig_pstate = pstate;
+ ParseState *orig_pstate = pstate;
+ int distance = INT_MAX;
+ List *matchedrte = NIL;
+ ListCell *l;
+ int i;
while (pstate != NULL)
{
- ListCell *l;
-
foreach(l, pstate->p_rtable)
{
- RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ AttrNumber rteclosest = InvalidAttrNumber;
+ int rtdistance = INT_MAX;
+ bool wrongalias;
- if (scanRTEForColumn(orig_pstate, rte, colname, location))
- return rte;
+ /*
+ * Get single best match from each RTE, or no match for RTE if
+ * there is a tie for best match within a given RTE
+ */
+ scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+ &rtdistance);
+
+ /* Was alias provided by user that does not match entry's alias? */
+ wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+ if (rtdistance == 0)
+ {
+ /* Exact match (for "wrong alias" or "wrong level" cases) */
+ closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+ /*
+ * Any exact match is always the uncontested best match. It
+ * doesn't seem worth considering the case where there are
+ * multiple exact matches, so we're done.
+ */
+ matchedrte = lappend(NIL, rte);
+ return matchedrte;
+ }
+
+ /*
+ * Charge extra (for inexact matches only) when an alias was
+ * specified that differs from what might have been used to
+ * correctly qualify this RTE's closest column
+ */
+ if (wrongalias)
+ rtdistance += 3;
+
+ if (rteclosest != InvalidAttrNumber)
+ {
+ if (rtdistance >= distance)
+ {
+ /*
+ * Perhaps record this attribute as being just as close in
+ * distance to closest attribute observed so far across
+ * entire range table. Iff this distance is ultimately the
+ * lowest distance observed overall, it may end up as the
+ * second match.
+ */
+ if (rtdistance == distance)
+ {
+ closest[1] = rteclosest;
+ matchedrte = lappend(matchedrte, rte);
+ }
+
+ continue;
+ }
+
+ /*
+ * One best match (better than any others in previous RTEs) was
+ * found within this RTE
+ */
+ distance = rtdistance;
+ /* New uncontested best match */
+ matchedrte = lappend(NIL, rte);
+ closest[0] = rteclosest;
+ }
+ else
+ {
+ /*
+ * Even though there were perhaps multiple joint-best matches
+ * within this RTE (implying that there can be no attribute
+ * suggestion from it), the shortest distance should still
+ * serve as the distance for later RTEs to beat (but naturally
+ * only if it happens to be the lowest so far across the entire
+ * range table).
+ */
+ distance = Min(distance, rtdistance);
+ }
}
pstate = pstate->parentParseState;
}
- return NULL;
+
+ /*
+ * Too many equally close partial matches found?
+ *
+ * It's useful to provide two matches for the common case where two range
+ * tables each have one equally distant candidate column, as when an
+ * unqualified (and therefore would-be ambiguous) column name is specified
+ * which is also misspelled by the user. It seems unhelpful to show no
+ * hint when this occurs, since in practice one attribute probably
+ * references the other in a foreign key relationship. However, when there
+ * are more than 2 range tables with equally distant matches that's
+ * probably because the matches are not useful, so don't suggest anything.
+ */
+ if (list_length(matchedrte) > 2)
+ return NIL;
+
+ /*
+ * Handle dropped columns, which can appear here as empty colnames per
+ * remarks within scanRTEForColumn(). If either the first or second
+ * suggested attributes are dropped, do not provide any suggestion.
+ */
+ i = 0;
+ foreach(l, matchedrte)
+ {
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ char *closestcol;
+
+ closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+ if (strcmp(closestcol, "") == 0)
+ return NIL;
+ }
+
+ /*
+ * Distance must be less than a normalized threshold in order to avoid
+ * completely ludicrous suggestions. Note that a distance of 6 will be
+ * seen when 6 deletions are required against actual attribute name, or 3
+ * insertions/substitutions.
+ */
+ if (distance > 6 && distance > strlen(colname) * 2 / 2)
+ return NIL;
+
+ return matchedrte;
}
/*
@@ -2855,41 +3050,92 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
/*
* Generate a suitable error about a missing column.
*
- * Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * Since this is a very common type of error, we work rather hard to produce a
+ * helpful message, going so far as to guess user's intent when a missing
+ * column name is probably intended to reference one of two would-be ambiguous
+ * attributes (when no alias/qualification was provided).
*/
void
errorMissingColumn(ParseState *pstate,
char *relname, char *colname, int location)
{
- RangeTblEntry *rte;
+ List *matchedrte;
+ AttrNumber closest[2];
+ RangeTblEntry *rte1 = NULL,
+ *rte2 = NULL;
+ char *closestcol1;
+ char *closestcol2;
/*
- * If relname was given, just play dumb and report it. (In practice, a
- * bad qualification name should end up at errorMissingRTE, not here, so
- * no need to work hard on this case.)
+ * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+ * the event of an exact match there is only ever one suggestion
*/
- if (relname)
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column %s.%s does not exist", relname, colname),
- parser_errposition(pstate, location)));
+ closest[0] = closest[1] = InvalidAttrNumber;
/*
- * Otherwise, search the entire rtable looking for possible matches. If
- * we find one, emit a hint about it.
+ * Search the entire rtable looking for possible matches. If we find one,
+ * emit a hint about it.
*
* TODO: improve this code (and also errorMissingRTE) to mention using
* LATERAL if appropriate.
*/
- rte = searchRangeTableForCol(pstate, colname, location);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column \"%s\" does not exist", colname),
- rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
- colname, rte->eref->aliasname) : 0,
- parser_errposition(pstate, location)));
+ matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+ closest);
+
+ /*
+ * In practice a bad qualification name should end up at errorMissingRTE,
+ * not here, so no need to work hard on this case.
+ *
+ * Extract RTEs for best match, if any, and joint best match, if any.
+ */
+ if (matchedrte)
+ {
+ rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+ if (list_length(matchedrte) > 1)
+ rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+ if (rte1 && closest[0] != InvalidAttrNumber)
+ closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+ if (rte2 && closest[1] != InvalidAttrNumber)
+ closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+ }
+
+ if (!rte2)
+ {
+ /*
+ * Handle case where there is zero or one column suggestions to hint,
+ * including exact matches referenced but not visible.
+ *
+ * Infer an exact match referenced despite not being visible from the
+ * fact that an attribute number was not passed back.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ rte1? closest[0] != InvalidAttrNumber?
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1):
+ errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+ colname, rte1->eref->aliasname): 0,
+ parser_errposition(pstate, location)));
+ }
+ else
+ {
+ /* Handle case where there are two equally useful column hints */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1,
+ rte2->eref->aliasname, closestcol2),
+ parser_errposition(pstate, location)));
+ }
}
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
int rtelevelsup);
extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
- char *colname, int location);
+ char *colname, int location, AttrNumber *matchedatt,
+ int *distance);
extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
int location);
extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 9b89e58..77829dc 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
-- add a check constraint (fails)
alter table atacc1 add constraint atacc_test1 check (test1>3);
ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1"."test".
drop table atacc1;
-- something a little more complicated
create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altstartwith;
^
+HINT: Perhaps you meant to reference the column "altstartwith"."col".
select * from altstartwith;
col
-----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altinhoid;
^
+HINT: Perhaps you meant to reference the column "altinhoid"."col".
select * from altwithoid;
col
-----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 1cb1c51..f4edcbe 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
200 | 1000 | 200 | 2001
(5 rows)
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR: column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+ ^
+HINT: Perhaps you meant to reference the column "t3"."x".
--
-- regression test for 8.1 merge right join bug
--
@@ -3388,6 +3394,39 @@ select * from
(0 rows)
--
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR: column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR: column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR: column "uunique1" does not exist
+LINE 1: select uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+ERROR: column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+ ^
+HINT: Perhaps you meant to reference the column "atts"."indexrelid".
+--
-- Test LATERAL
--
select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 8892bb4..2cb4aa1 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4771,6 +4771,7 @@ END$$;
ERROR: column "foo" does not exist
LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
^
+HINT: Perhaps you meant to reference the column "room"."roomno".
QUERY: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
CONTEXT: PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
-- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname; -- error
ERROR: column fullname.text does not exist
LINE 1: select fullname.text from fullname;
^
+HINT: Perhaps you meant to reference the column "fullname"."last".
-- same, but RECORD instead of named composite type:
select cast (row('Jim', 'Beam') as text);
row
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..48c75fd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2368,6 +2368,7 @@ select xmin, * from fooview; -- fail, views don't have such a column
ERROR: column "xmin" does not exist
LINE 1: select xmin, * from fooview;
^
+HINT: Perhaps you meant to reference the column "fooview"."x".
select reltoastrelid, relkind, relfrozenxid
from pg_class where oid = 'fooview'::regclass;
reltoastrelid | relkind | relfrozenxid
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
ERROR: column "oid" does not exist
LINE 1: SELECT count(oid) FROM wo;
^
+HINT: Perhaps you meant to reference the column "wo"."i".
VACUUM ANALYZE wi;
VACUUM ANALYZE wo;
SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index fa3e068..4d60f9e 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
select * from t1 left join t2 on (t1.a = t2.a);
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
--
-- regression test for 8.1 merge right join bug
--
@@ -1047,6 +1051,26 @@ select * from
int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
--
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+--
-- Test LATERAL
--
--
2.0.1
On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Patch 1 does a couple of things:
- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
it anymore, and moved to core.
- Removal of the LESS_EQUAL flag that made the original submission patch
harder to understand. All the Levenshtein functions wrap a single common
function.
- Documentation is moved, and regression tests for Levenshtein functions are
added.
- Functions with costs are renamed with a suffix with costs.
After hacking this feature, I came up with the conclusion that it would be
better for the user experience to move directly into backend code all the
Levenshtein functions, instead of only moving in the common wrapper as Peter
did in his original patches. This is done this way to avoid keeping portions
of the same feature in two different places of the code (backend with common
routine, fuzzystrmatch with levenshtein functions) and concentrate all the
logic in a single place. Now, we may as well consider renaming the
levenshtein functions into smarter names, like str_distance, and keep
fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the
str_distance functions.
This is not cool. Anyone who is running a 9.4 or earlier database
using fuzzystrmatch and upgrades, either via dump-and-restore or
pg_upgrade, to a version with this patch applied will have a broken
database. They will still have the catalog entries for the 1.0
definitions, but those definitions won't be resolvable inside the new
cluster's .so file. The user will get a fairly-unfriendly error
message that won't go away until they upgrade the extension, which may
involve dealing with dependency hell since the new definitions are in
a different place than the old definitions, and there may be
dependencies on the old definitions. One of the great advantages of
extension packaging is that this kind of problem is quite easily
avoidable, so let's avoid it.
There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core. Then, the
whole thing will be completely transparent to users. They won't need
to upgrade their fuzzystrmatch definitions at all, and everything will
just work; under the covers, the fuzzystrmatch code will now be
calling into core code rather than to code located in that same
module, but the user doesn't need to know or care about that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.
I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.
It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.
For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.
I had two thoughts:
1. Should we consider making levenshtein available to frontend programs
as well as backend?
2. Would it provide better matching to use Damerau-Levenshtein[1]http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance instead
of raw Levenshtein?
.oO(Would anyone be so bold as to attempt to implement bitap[2]http://en.wikipedia.org/wiki/Bitap_algorithm using
bitmapsets ...)
[1]: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
[2]: http://en.wikipedia.org/wiki/Bitap_algorithm
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
I had two thoughts:
1. Should we consider making levenshtein available to frontend programs
as well as backend?
I don't think so. Why would that be useful?
2. Would it provide better matching to use Damerau-Levenshtein[1] instead
of raw Levenshtein?
Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.
Yes, that's a point I raised upthread as well. What about renaming those
functions as string_distance and string_distance_less_than? Then have only
fuzzystrmatch do some DirectFunctionCall using the in-core functions?
--
Michael
Peter Geoghegan wrote:
Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.
Yes, it's for typos. I guess it's an unfrequent scenario to have both a
typoed column and a column that's missing the plural declension, which
is the case in which Damerau-Lvsh would be a win.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/18/2014 10:47 AM, Michael Paquier wrote:
On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.I noticed in passing that your revision says this *within* levenshtein.c:
+ * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c.Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.Fine for me. I'll move this patch to the next stage then.
There are a bunch of compiler warnings:
parse_relation.c: In function �errorMissingColumn�:
parse_relation.c:3114:447: warning: �closestcol1� may be used
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3066:8: note: �closestcol1� was declared here
parse_relation.c:3129:29: warning: �closestcol2� may be used
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3067:8: note: �closestcol2� was declared here
levenshtein.c: In function �levenshtein_common�:
levenshtein.c:107:6: warning: unused variable �start_column_local�
[-Wunused-variable]
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
On 07/18/2014 10:47 AM, Michael Paquier wrote:
On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:
I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.I noticed in passing that your revision says this *within* levenshtein.c:
+ * Guaranteed to work with Name datatype's cstrings. + * For full details see levenshtein.c.Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.Fine for me. I'll move this patch to the next stage then.
There are a bunch of compiler warnings:
parse_relation.c: In function ‘errorMissingColumn’:
parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
levenshtein.c: In function ‘levenshtein_common’:
levenshtein.c:107:6: warning: unused variable ‘start_column_local’
[-Wunused-variable]
Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.
Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1]/messages/by-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com -- Peter Geoghegan.
Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.
This is not a completion mechanism; it is supposed to work on
*complete* column references with slight misspellings (e.g. incorrect
use of plurals, or column references with an omitted underscore
character). Weighing Tom's concerns about suggestions that are of
absolute low quality is what makes me conclude that this is the thing
to do.
[1]: /messages/by-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
0001-Levenshtein-distance-column-HINT.patchtext/x-patch; charset=US-ASCII; name=0001-Levenshtein-distance-column-HINT.patchDownload
From 830bf9f668972ba6b531df5d4fcbd73db3472434 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH] Levenshtein distance column HINT
Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised. The user will see this HINT
when he or she fat-fingers a column reference in his or her ad-hoc SQL
query, or incorrectly pluralizes or fails to pluralize a column
reference.
The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (iff each
candidate column comes from a separate RTE). Limiting the cases where
multiple equally likely suggestions are all offered at once is a measure
against suggestions that are of low quality in an absolute sense.
A further, final measure is taken against suggestions that are of low
absolute quality: If the distance exceeds a normalized distance
threshold, no suggestion is given.
The contrib Levenshtein distance implementation is moved from /contrib
to core. However, the SQL-callable functions may only be used with the
fuzzystmatch extension installed, just as before -- the fuzzystmatch
definitions become mere forwarding stubs.
---
contrib/fuzzystrmatch/Makefile | 3 -
contrib/fuzzystrmatch/fuzzystrmatch.c | 81 ++++--
contrib/fuzzystrmatch/levenshtein.c | 403 ------------------------------
src/backend/parser/parse_expr.c | 9 +-
src/backend/parser/parse_func.c | 2 +-
src/backend/parser/parse_relation.c | 319 ++++++++++++++++++++---
src/backend/utils/adt/Makefile | 2 +
src/backend/utils/adt/levenshtein.c | 393 +++++++++++++++++++++++++++++
src/backend/utils/adt/varlena.c | 25 ++
src/include/parser/parse_relation.h | 3 +-
src/include/utils/builtins.h | 5 +
src/test/regress/expected/alter_table.out | 8 +
src/test/regress/expected/join.out | 39 +++
src/test/regress/expected/plpgsql.out | 1 +
src/test/regress/expected/rowtypes.out | 1 +
src/test/regress/expected/rules.out | 1 +
src/test/regress/expected/without_oid.out | 1 +
src/test/regress/sql/join.sql | 24 ++
18 files changed, 849 insertions(+), 471 deletions(-)
delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
create mode 100644 src/backend/utils/adt/levenshtein.c
diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
/* These prevent GH from becoming F */
#define NOGHTOF(c) (getcode(c) & 16) /* BDH */
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
- while (len > 0)
- {
- len--;
- if (s1[len] != s2[len])
- return false;
- }
- return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
PG_FUNCTION_INFO_V1(levenshtein_with_costs);
Datum
levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
int ins_c = PG_GETARG_INT32(2);
int del_c = PG_GETARG_INT32(3);
int sub_c = PG_GETARG_INT32(4);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+ const char *s_data;
+ const char *t_data;
+ int s_bytes,
+ t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+ del_c, sub_c));
}
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
{
text *src = PG_GETARG_TEXT_PP(0);
text *dst = PG_GETARG_TEXT_PP(1);
-
- PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+ const char *s_data;
+ const char *t_data;
+ int s_bytes,
+ t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+ 1));
}
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
int del_c = PG_GETARG_INT32(3);
int sub_c = PG_GETARG_INT32(4);
int max_d = PG_GETARG_INT32(5);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+ const char *s_data;
+ const char *t_data;
+ int s_bytes,
+ t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+ ins_c, del_c, sub_c, max_d));
}
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
text *src = PG_GETARG_TEXT_PP(0);
text *dst = PG_GETARG_TEXT_PP(1);
int max_d = PG_GETARG_INT32(2);
-
- PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
+ const char *s_data;
+ const char *t_data;
+ int s_bytes,
+ t_bytes;
+
+ /* Extract a pointer to the actual character data */
+ s_data = VARDATA_ANY(src);
+ t_data = VARDATA_ANY(dst);
+ /* Determine length of each string in bytes and characters */
+ s_bytes = VARSIZE_ANY_EXHDR(src);
+ t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+ PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+ 1, 1, 1, max_d));
}
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN 255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t. The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification. In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound. From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row. This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length. The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix. When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
- int ins_c, int del_c, int sub_c)
-#endif
-{
- int m,
- n,
- s_bytes,
- t_bytes;
- int *prev;
- int *curr;
- int *s_char_len = NULL;
- int i,
- j;
- const char *s_data;
- const char *t_data;
- const char *y;
-
- /*
- * For levenshtein_less_equal_internal, we have real variables called
- * start_column and stop_column; otherwise it's just short-hand for 0 and
- * m.
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
- int start_column,
- stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
- /* Extract a pointer to the actual character data. */
- s_data = VARDATA_ANY(s);
- t_data = VARDATA_ANY(t);
-
- /* Determine length of each string in bytes and characters. */
- s_bytes = VARSIZE_ANY_EXHDR(s);
- t_bytes = VARSIZE_ANY_EXHDR(t);
- m = pg_mbstrlen_with_len(s_data, s_bytes);
- n = pg_mbstrlen_with_len(t_data, t_bytes);
-
- /*
- * We can transform an empty s into t with n insertions, or a non-empty t
- * into an empty s with m deletions.
- */
- if (!m)
- return n * ins_c;
- if (!n)
- return m * del_c;
-
- /*
- * For security concerns, restrict excessive CPU+RAM usage. (This
- * implementation uses O(m) memory and has O(mn) complexity.)
- */
- if (m > MAX_LEVENSHTEIN_STRLEN ||
- n > MAX_LEVENSHTEIN_STRLEN)
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("argument exceeds the maximum length of %d bytes",
- MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
- /* Initialize start and stop columns. */
- start_column = 0;
- stop_column = m + 1;
-
- /*
- * If max_d >= 0, determine whether the bound is impossibly tight. If so,
- * return max_d + 1 immediately. Otherwise, determine whether it's tight
- * enough to limit the computation we must perform. If so, figure out
- * initial stop column.
- */
- if (max_d >= 0)
- {
- int min_theo_d; /* Theoretical minimum distance. */
- int max_theo_d; /* Theoretical maximum distance. */
- int net_inserts = n - m;
-
- min_theo_d = net_inserts < 0 ?
- -net_inserts * del_c : net_inserts * ins_c;
- if (min_theo_d > max_d)
- return max_d + 1;
- if (ins_c + del_c < sub_c)
- sub_c = ins_c + del_c;
- max_theo_d = min_theo_d + sub_c * Min(m, n);
- if (max_d >= max_theo_d)
- max_d = -1;
- else if (ins_c + del_c > 0)
- {
- /*
- * Figure out how much of the first row of the notional matrix we
- * need to fill in. If the string is growing, the theoretical
- * minimum distance already incorporates the cost of deleting the
- * number of characters necessary to make the two strings equal in
- * length. Each additional deletion forces another insertion, so
- * the best-case total cost increases by ins_c + del_c. If the
- * string is shrinking, the minimum theoretical cost assumes no
- * excess deletions; that is, we're starting no further right than
- * column n - m. If we do start further right, the best-case
- * total cost increases by ins_c + del_c for each move right.
- */
- int slack_d = max_d - min_theo_d;
- int best_column = net_inserts < 0 ? -net_inserts : 0;
-
- stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
- if (stop_column > m)
- stop_column = m + 1;
- }
- }
-#endif
-
- /*
- * In order to avoid calling pg_mblen() repeatedly on each character in s,
- * we cache all the lengths before starting the main loop -- but if all
- * the characters in both strings are single byte, then we skip this and
- * use a fast-path in the main loop. If only one string contains
- * multi-byte characters, we still build the array, so that the fast-path
- * needn't deal with the case where the array hasn't been initialized.
- */
- if (m != s_bytes || n != t_bytes)
- {
- int i;
- const char *cp = s_data;
-
- s_char_len = (int *) palloc((m + 1) * sizeof(int));
- for (i = 0; i < m; ++i)
- {
- s_char_len[i] = pg_mblen(cp);
- cp += s_char_len[i];
- }
- s_char_len[i] = 0;
- }
-
- /* One more cell for initialization column and row. */
- ++m;
- ++n;
-
- /* Previous and current rows of notional array. */
- prev = (int *) palloc(2 * m * sizeof(int));
- curr = prev + m;
-
- /*
- * To transform the first i characters of s into the first 0 characters of
- * t, we must perform i deletions.
- */
- for (i = START_COLUMN; i < STOP_COLUMN; i++)
- prev[i] = i * del_c;
-
- /* Loop through rows of the notional array */
- for (y = t_data, j = 1; j < n; j++)
- {
- int *temp;
- const char *x = s_data;
- int y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * In the best case, values percolate down the diagonal unchanged, so
- * we must increment stop_column unless it's already on the right end
- * of the array. The inner loop will read prev[stop_column], so we
- * have to initialize it even though it shouldn't affect the result.
- */
- if (stop_column < m)
- {
- prev[stop_column] = max_d + 1;
- ++stop_column;
- }
-
- /*
- * The main loop fills in curr, but curr[0] needs a special case: to
- * transform the first 0 characters of s into the first j characters
- * of t, we must perform j insertions. However, if start_column > 0,
- * this special case does not apply.
- */
- if (start_column == 0)
- {
- curr[0] = j * ins_c;
- i = 1;
- }
- else
- i = start_column;
-#else
- curr[0] = j * ins_c;
- i = 1;
-#endif
-
- /*
- * This inner loop is critical to performance, so we include a
- * fast-path to handle the (fairly common) case where no multibyte
- * characters are in the mix. The fast-path is entitled to assume
- * that if s_char_len is not initialized then BOTH strings contain
- * only single-byte characters.
- */
- if (s_char_len != NULL)
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
- int x_char_len = s_char_len[i - 1];
-
- /*
- * Calculate costs for insertion, deletion, and substitution.
- *
- * When calculating cost for substitution, we compare the last
- * character of each possibly-multibyte character first,
- * because that's enough to rule out most mis-matches. If we
- * get past that test, then we compare the lengths and the
- * remaining bytes.
- */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- if (x[x_char_len - 1] == y[y_char_len - 1]
- && x_char_len == y_char_len &&
- (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
- sub = prev[i - 1];
- else
- sub = prev[i - 1] + sub_c;
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x += x_char_len;
- }
- }
- else
- {
- for (; i < STOP_COLUMN; i++)
- {
- int ins;
- int del;
- int sub;
-
- /* Calculate costs for insertion, deletion, and substitution. */
- ins = prev[i] + ins_c;
- del = curr[i - 1] + del_c;
- sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
- /* Take the one with minimum cost. */
- curr[i] = Min(ins, del);
- curr[i] = Min(curr[i], sub);
-
- /* Point to next character. */
- x++;
- }
- }
-
- /* Swap current row with previous row. */
- temp = curr;
- curr = prev;
- prev = temp;
-
- /* Point to next character. */
- y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
- /*
- * This chunk of code represents a significant performance hit if used
- * in the case where there is no max_d bound. This is probably not
- * because the max_d >= 0 test itself is expensive, but rather because
- * the possibility of needing to execute this code prevents tight
- * optimization of the loop as a whole.
- */
- if (max_d >= 0)
- {
- /*
- * The "zero point" is the column of the current row where the
- * remaining portions of the strings are of equal length. There
- * are (n - 1) characters in the target string, of which j have
- * been transformed. There are (m - 1) characters in the source
- * string, so we want to find the value for zp where (n - 1) - j =
- * (m - 1) - zp.
- */
- int zp = j - (n - m);
-
- /* Check whether the stop column can slide left. */
- while (stop_column > 0)
- {
- int ii = stop_column - 1;
- int net_inserts = ii - zp;
-
- if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
- stop_column--;
- }
-
- /* Check whether the start column can slide right. */
- while (start_column < stop_column)
- {
- int net_inserts = start_column - zp;
-
- if (prev[start_column] +
- (net_inserts > 0 ? net_inserts * ins_c :
- -net_inserts * del_c) <= max_d)
- break;
-
- /*
- * We'll never again update these values, so we must make sure
- * there's nothing here that could confuse any future
- * iteration of the outer loop.
- */
- prev[start_column] = max_d + 1;
- curr[start_column] = max_d + 1;
- if (start_column != 0)
- s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
- start_column++;
- }
-
- /* If they cross, we're going to exceed the bound. */
- if (start_column >= stop_column)
- return max_d + 1;
- }
-#endif
- }
-
- /*
- * Because the final value was swapped from the previous row to the
- * current row, that's where we'll find it.
- */
- return prev[m - 1];
-}
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field2);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field3);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
colname = strVal(field4);
/* Try to identify as a column of the RTE */
- node = scanRTEForColumn(pstate, rte, colname, cref->location);
+ node = scanRTEForColumn(pstate, rte, colname, cref->location,
+ NULL, NULL);
if (node == NULL)
{
/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
((Var *) first_arg)->varno,
((Var *) first_arg)->varlevelsup);
/* Return a Var if funcname matches a column, else NULL */
- return scanRTEForColumn(pstate, rte, funcname, location);
+ return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
}
/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..1697b77 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include <ctype.h>
+#include <limits.h>
#include "access/htup_details.h"
#include "access/sysattr.h"
@@ -520,6 +521,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
}
/*
+ * distanceName
+ * Return Levenshtein distance between an actual column name and possible
+ * partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+ int len = strlen(actual),
+ match_len = strlen(match);
+
+ /* Charge half as much per deletion as per insertion or per substitution */
+ return varstr_leven_less_equal(actual, len, match, match_len,
+ 2, 1, 2, max);
+}
+
+/*
* scanRTEForColumn
* Search the column names of a single RTE for the given name.
* If found, return an appropriate Var node, else return NULL.
@@ -527,10 +544,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
*
* Side effect: if we find a match, mark the RTE as requiring read access
* for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance. Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match). This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful. Note that we never consider system column names
+ * when performing fuzzy matching.
*/
Node *
scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
- int location)
+ int location, AttrNumber *closest, int *distance)
{
Node *result = NULL;
int attnum = 0;
@@ -548,12 +579,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
* Should this somehow go wrong and we try to access a dropped column,
* we'll still catch it by virtue of the checks in
* get_rte_attribute_type(), which is called by make_var(). That routine
- * has to do a cache lookup anyway, so the check there is cheap.
+ * has to do a cache lookup anyway, so the check there is cheap. Callers
+ * interested in finding match with shortest distance need to defend
+ * against this directly, though.
*/
foreach(c, rte->eref->colnames)
{
+ const char *attcolname = strVal(lfirst(c));
+
attnum++;
- if (strcmp(strVal(lfirst(c)), colname) == 0)
+ if (strcmp(attcolname, colname) == 0)
{
if (result)
ereport(ERROR,
@@ -566,6 +601,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
markVarForSelectPriv(pstate, var, rte);
result = (Node *) var;
}
+
+ if (distance && *distance != 0)
+ {
+ if (result)
+ {
+ /* Exact match just found */
+ *distance = 0;
+ }
+ else
+ {
+ int lowestdistance = *distance;
+ int thisdistance = distanceName(attcolname, colname,
+ lowestdistance);
+
+ if (thisdistance >= lowestdistance)
+ {
+ /*
+ * This match distance may equal a prior match within this
+ * same range table. When that happens, the prior match is
+ * discarded as worthless, since a single best match is
+ * required within a RTE.
+ */
+ if (thisdistance == lowestdistance)
+ *closest = InvalidAttrNumber;
+
+ continue;
+ }
+
+ /* Store new lowest observed distance for RT */
+ *distance = thisdistance;
+ }
+ *closest = attnum;
+ }
}
/*
@@ -642,7 +710,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
continue;
/* use orig_pstate here to get the right sublevels_up */
- newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+ newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+ NULL, NULL);
if (newresult)
{
@@ -668,8 +737,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
/*
* searchRangeTableForCol
- * See if any RangeTblEntry could possibly provide the given column name.
- * If so, return a pointer to the RangeTblEntry; else return NULL.
+ * See if any RangeTblEntry could possibly provide the given column name (or
+ * find the best match available). Returns a list of equally likely
+ * candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily; we provide the closet columns if there
+ * was not an exact match. Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
*
* This is different from colNameToVar in that it considers every entry in
* the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +753,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
* matches, but only one will be returned). This must be used ONLY as a
* heuristic in giving suitable error messages. See errorMissingColumn.
*/
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+ int location, AttrNumber closest[2])
{
- ParseState *orig_pstate = pstate;
+ ParseState *orig_pstate = pstate;
+ int distance = INT_MAX;
+ List *matchedrte = NIL;
+ ListCell *l;
+ int i;
while (pstate != NULL)
{
- ListCell *l;
-
foreach(l, pstate->p_rtable)
{
- RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ AttrNumber rteclosest = InvalidAttrNumber;
+ int rtdistance = INT_MAX;
+ bool wrongalias;
- if (scanRTEForColumn(orig_pstate, rte, colname, location))
- return rte;
+ /*
+ * Get single best match from each RTE, or no match for RTE if
+ * there is a tie for best match within a given RTE
+ */
+ scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+ &rtdistance);
+
+ /* Was alias provided by user that does not match entry's alias? */
+ wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+ if (rtdistance == 0)
+ {
+ /* Exact match (for "wrong alias" or "wrong level" cases) */
+ closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+ /*
+ * Any exact match is always the uncontested best match. It
+ * doesn't seem worth considering the case where there are
+ * multiple exact matches, so we're done.
+ */
+ matchedrte = lappend(NIL, rte);
+ return matchedrte;
+ }
+
+ /*
+ * Charge extra (for inexact matches only) when an alias was
+ * specified that differs from what might have been used to
+ * correctly qualify this RTE's closest column
+ */
+ if (wrongalias)
+ rtdistance += 3;
+
+ if (rteclosest != InvalidAttrNumber)
+ {
+ if (rtdistance >= distance)
+ {
+ /*
+ * Perhaps record this attribute as being just as close in
+ * distance to closest attribute observed so far across
+ * entire range table. Iff this distance is ultimately the
+ * lowest distance observed overall, it may end up as the
+ * second match.
+ */
+ if (rtdistance == distance)
+ {
+ closest[1] = rteclosest;
+ matchedrte = lappend(matchedrte, rte);
+ }
+
+ continue;
+ }
+
+ /*
+ * One best match (better than any others in previous RTEs) was
+ * found within this RTE
+ */
+ distance = rtdistance;
+ /* New uncontested best match */
+ matchedrte = lappend(NIL, rte);
+ closest[0] = rteclosest;
+ }
+ else
+ {
+ /*
+ * Even though there were perhaps multiple joint-best matches
+ * within this RTE (implying that there can be no attribute
+ * suggestion from it), the shortest distance should still
+ * serve as the distance for later RTEs to beat (but naturally
+ * only if it happens to be the lowest so far across the entire
+ * range table).
+ */
+ distance = Min(distance, rtdistance);
+ }
}
pstate = pstate->parentParseState;
}
- return NULL;
+
+ /*
+ * Too many equally close partial matches found?
+ *
+ * It's useful to provide two matches for the common case where two range
+ * tables each have one equally distant candidate column, as when an
+ * unqualified (and therefore would-be ambiguous) column name is specified
+ * which is also misspelled by the user. It seems unhelpful to show no
+ * hint when this occurs, since in practice one attribute probably
+ * references the other in a foreign key relationship. However, when there
+ * are more than 2 range tables with equally distant matches that's
+ * probably because the matches are not useful, so don't suggest anything.
+ */
+ if (list_length(matchedrte) > 2)
+ return NIL;
+
+ /*
+ * Handle dropped columns, which can appear here as empty colnames per
+ * remarks within scanRTEForColumn(). If either the first or second
+ * suggested attributes are dropped, do not provide any suggestion.
+ */
+ i = 0;
+ foreach(l, matchedrte)
+ {
+ RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+ char *closestcol;
+
+ closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+ if (strcmp(closestcol, "") == 0)
+ return NIL;
+ }
+
+ /*
+ * Distance must be less than a normalized threshold in order to avoid
+ * completely ludicrous suggestions. Note that a distance of 6 will be
+ * seen when 6 deletions are required against actual attribute name, or 3
+ * insertions/substitutions.
+ */
+ if (distance > 6 && distance > strlen(colname) / 2)
+ return NIL;
+
+ return matchedrte;
}
/*
@@ -2856,40 +3050,95 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
* Generate a suitable error about a missing column.
*
* Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * produce a helpful message, going so far as to guess user's intent
+ * when a missing column name is probably intended to reference one of
+ * two would-be ambiguous attributes (when no alias/qualification was
+ * provided).
*/
void
errorMissingColumn(ParseState *pstate,
char *relname, char *colname, int location)
{
- RangeTblEntry *rte;
+ List *matchedrte;
+ AttrNumber closest[2];
+ RangeTblEntry *rte1 = NULL,
+ *rte2 = NULL;
+ char *closestcol1 = NULL;
+ char *closestcol2 = NULL;
/*
- * If relname was given, just play dumb and report it. (In practice, a
- * bad qualification name should end up at errorMissingRTE, not here, so
- * no need to work hard on this case.)
+ * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+ * the event of an exact match there is only ever one suggestion
*/
- if (relname)
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column %s.%s does not exist", relname, colname),
- parser_errposition(pstate, location)));
+ closest[0] = closest[1] = InvalidAttrNumber;
/*
- * Otherwise, search the entire rtable looking for possible matches. If
- * we find one, emit a hint about it.
+ * Search the entire rtable looking for possible matches. If we find one,
+ * emit a hint about it.
*
* TODO: improve this code (and also errorMissingRTE) to mention using
* LATERAL if appropriate.
*/
- rte = searchRangeTableForCol(pstate, colname, location);
-
- ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_COLUMN),
- errmsg("column \"%s\" does not exist", colname),
- rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
- colname, rte->eref->aliasname) : 0,
- parser_errposition(pstate, location)));
+ matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+ closest);
+
+ /*
+ * In practice a bad qualification name should end up at errorMissingRTE,
+ * not here, so no need to work hard on this case.
+ *
+ * Extract RTEs for best match, if any, and joint best match, if any.
+ */
+ if (matchedrte)
+ {
+ rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+ if (list_length(matchedrte) > 1)
+ rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+ if (rte1 && closest[0] != InvalidAttrNumber)
+ closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+ if (rte2 && closest[1] != InvalidAttrNumber)
+ closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+ }
+
+ if (!rte2)
+ {
+ /*
+ * Handle case where there is zero or one column suggestions to hint,
+ * including exact matches referenced but not visible.
+ *
+ * Infer an exact match referenced despite not being visible from the
+ * fact that an attribute number was not passed back.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ rte1? closest[0] != InvalidAttrNumber?
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1):
+ errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+ colname, rte1->eref->aliasname): 0,
+ parser_errposition(pstate, location)));
+ }
+ else
+ {
+ /*
+ * Handle case where there are two equally useful column hints, each
+ * from a different RTE
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ relname?
+ errmsg("column %s.%s does not exist", relname, colname):
+ errmsg("column \"%s\" does not exist", colname),
+ errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+ rte1->eref->aliasname, closestcol1,
+ rte2->eref->aliasname, closestcol2),
+ parser_errposition(pstate, location)));
+ }
}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..3ea9bf4 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -38,4 +38,6 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
like.o: like.c like_match.c
+varlena.o: varlena.c levenshtein.c
+
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..bb8b7bf
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,393 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ * Levenshtein distance implementation.
+ *
+ * Original author: Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costings and a "max" value above which exact distances are not
+ * interesting. Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm. Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration. Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#define MAX_LEVENSHTEIN_STRLEN 255
+
+/*
+ * Calculates Levenshtein distance metric between supplied csrings, which are
+ * not necessarily null-terminated. Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t. The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification. In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound. From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row. This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length. The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix. When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+int
+#ifdef LEVENSHTEIN_LESS_EQUAL
+varstr_leven_less_equal(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c, int max_d)
+#else
+varstr_leven(const char *source, int slen, const char *target, int tlen,
+ int ins_c, int del_c, int sub_c)
+#endif
+{
+ int m,
+ n;
+ int *prev;
+ int *curr;
+ int *s_char_len = NULL;
+ int i,
+ j;
+ const char *y;
+
+ /*
+ * For varstr_levenshtein_less_equal, we have real variables called
+ * start_column and stop_column; otherwise it's just short-hand for 0 and
+ * m.
+ */
+#ifdef LEVENSHTEIN_LESS_EQUAL
+ int start_column,
+ stop_column;
+
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN start_column
+#define STOP_COLUMN stop_column
+#else
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN 0
+#define STOP_COLUMN m
+#endif
+
+ m = pg_mbstrlen_with_len(source, slen);
+ n = pg_mbstrlen_with_len(target, tlen);
+
+ /*
+ * We can transform an empty s into t with n insertions, or a non-empty t
+ * into an empty s with m deletions.
+ */
+ if (!m)
+ return n * ins_c;
+ if (!n)
+ return m * del_c;
+
+ /*
+ * A common use for Levenshtein distance is to match column names.
+ * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+ * guaranteed to work.
+ */
+ StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+ "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+ /*
+ * For security concerns, restrict excessive CPU+RAM usage. (This
+ * implementation uses O(m) memory and has O(mn) complexity.)
+ */
+ if (m > MAX_LEVENSHTEIN_STRLEN ||
+ n > MAX_LEVENSHTEIN_STRLEN)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("argument exceeds the maximum length of %d bytes",
+ MAX_LEVENSHTEIN_STRLEN)));
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+ /* Initialize start and stop columns. */
+ start_column = 0;
+ stop_column = m + 1;
+
+ /*
+ * If max_d >= 0, determine whether the bound is impossibly tight. If so,
+ * return max_d + 1 immediately. Otherwise, determine whether it's tight
+ * enough to limit the computation we must perform. If so, figure out
+ * initial stop column.
+ */
+ if (max_d >= 0)
+ {
+ int min_theo_d; /* Theoretical minimum distance. */
+ int max_theo_d; /* Theoretical maximum distance. */
+ int net_inserts = n - m;
+
+ min_theo_d = net_inserts < 0 ?
+ -net_inserts * del_c : net_inserts * ins_c;
+ if (min_theo_d > max_d)
+ return max_d + 1;
+ if (ins_c + del_c < sub_c)
+ sub_c = ins_c + del_c;
+ max_theo_d = min_theo_d + sub_c * Min(m, n);
+ if (max_d >= max_theo_d)
+ max_d = -1;
+ else if (ins_c + del_c > 0)
+ {
+ /*
+ * Figure out how much of the first row of the notional matrix we
+ * need to fill in. If the string is growing, the theoretical
+ * minimum distance already incorporates the cost of deleting the
+ * number of characters necessary to make the two strings equal in
+ * length. Each additional deletion forces another insertion, so
+ * the best-case total cost increases by ins_c + del_c. If the
+ * string is shrinking, the minimum theoretical cost assumes no
+ * excess deletions; that is, we're starting no further right than
+ * column n - m. If we do start further right, the best-case
+ * total cost increases by ins_c + del_c for each move right.
+ */
+ int slack_d = max_d - min_theo_d;
+ int best_column = net_inserts < 0 ? -net_inserts : 0;
+
+ stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+ if (stop_column > m)
+ stop_column = m + 1;
+ }
+ }
+#endif
+
+ /*
+ * In order to avoid calling pg_mblen() repeatedly on each character in s,
+ * we cache all the lengths before starting the main loop -- but if all
+ * the characters in both strings are single byte, then we skip this and
+ * use a fast-path in the main loop. If only one string contains
+ * multi-byte characters, we still build the array, so that the fast-path
+ * needn't deal with the case where the array hasn't been initialized.
+ */
+ if (m != slen || n != tlen)
+ {
+ int i;
+ const char *cp = source;
+
+ s_char_len = (int *) palloc((m + 1) * sizeof(int));
+ for (i = 0; i < m; ++i)
+ {
+ s_char_len[i] = pg_mblen(cp);
+ cp += s_char_len[i];
+ }
+ s_char_len[i] = 0;
+ }
+
+ /* One more cell for initialization column and row. */
+ ++m;
+ ++n;
+
+ /* Previous and current rows of notional array. */
+ prev = (int *) palloc(2 * m * sizeof(int));
+ curr = prev + m;
+
+ /*
+ * To transform the first i characters of s into the first 0 characters of
+ * t, we must perform i deletions.
+ */
+ for (i = START_COLUMN; i < STOP_COLUMN; i++)
+ prev[i] = i * del_c;
+
+ /* Loop through rows of the notional array */
+ for (y = target, j = 1; j < n; j++)
+ {
+ int *temp;
+ const char *x = source;
+ int y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+ /*
+ * In the best case, values percolate down the diagonal unchanged, so
+ * we must increment stop_column unless it's already on the right end
+ * of the array. The inner loop will read prev[stop_column], so we
+ * have to initialize it even though it shouldn't affect the result.
+ */
+ if (stop_column < m)
+ {
+ prev[stop_column] = max_d + 1;
+ ++stop_column;
+ }
+
+ /*
+ * The main loop fills in curr, but curr[0] needs a special case: to
+ * transform the first 0 characters of s into the first j characters
+ * of t, we must perform j insertions. However, if start_column > 0,
+ * this special case does not apply.
+ */
+ if (start_column == 0)
+ {
+ curr[0] = j * ins_c;
+ i = 1;
+ }
+ else
+ i = start_column;
+#else
+ curr[0] = j * ins_c;
+ i = 1;
+#endif
+
+ /*
+ * This inner loop is critical to performance, so we include a
+ * fast-path to handle the (fairly common) case where no multibyte
+ * characters are in the mix. The fast-path is entitled to assume
+ * that if s_char_len is not initialized then BOTH strings contain
+ * only single-byte characters.
+ */
+ if (s_char_len != NULL)
+ {
+ for (; i < STOP_COLUMN; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+ int x_char_len = s_char_len[i - 1];
+
+ /*
+ * Calculate costs for insertion, deletion, and substitution.
+ *
+ * When calculating cost for substitution, we compare the last
+ * character of each possibly-multibyte character first,
+ * because that's enough to rule out most mis-matches. If we
+ * get past that test, then we compare the lengths and the
+ * remaining bytes.
+ */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ if (x[x_char_len - 1] == y[y_char_len - 1]
+ && x_char_len == y_char_len &&
+ (x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+ sub = prev[i - 1];
+ else
+ sub = prev[i - 1] + sub_c;
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x += x_char_len;
+ }
+ }
+ else
+ {
+ for (; i < STOP_COLUMN; i++)
+ {
+ int ins;
+ int del;
+ int sub;
+
+ /* Calculate costs for insertion, deletion, and substitution. */
+ ins = prev[i] + ins_c;
+ del = curr[i - 1] + del_c;
+ sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+ /* Take the one with minimum cost. */
+ curr[i] = Min(ins, del);
+ curr[i] = Min(curr[i], sub);
+
+ /* Point to next character. */
+ x++;
+ }
+ }
+
+ /* Swap current row with previous row. */
+ temp = curr;
+ curr = prev;
+ prev = temp;
+
+ /* Point to next character. */
+ y += y_char_len;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+ /*
+ * This chunk of code represents a significant performance hit if used
+ * in the case where there is no max_d bound. This is probably not
+ * because the max_d >= 0 test itself is expensive, but rather because
+ * the possibility of needing to execute this code prevents tight
+ * optimization of the loop as a whole.
+ */
+ if (max_d >= 0)
+ {
+ /*
+ * The "zero point" is the column of the current row where the
+ * remaining portions of the strings are of equal length. There
+ * are (n - 1) characters in the target string, of which j have
+ * been transformed. There are (m - 1) characters in the source
+ * string, so we want to find the value for zp where (n - 1) - j =
+ * (m - 1) - zp.
+ */
+ int zp = j - (n - m);
+
+ /* Check whether the stop column can slide left. */
+ while (stop_column > 0)
+ {
+ int ii = stop_column - 1;
+ int net_inserts = ii - zp;
+
+ if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+ stop_column--;
+ }
+
+ /* Check whether the start column can slide right. */
+ while (start_column < stop_column)
+ {
+ int net_inserts = start_column - zp;
+
+ if (prev[start_column] +
+ (net_inserts > 0 ? net_inserts * ins_c :
+ -net_inserts * del_c) <= max_d)
+ break;
+
+ /*
+ * We'll never again update these values, so we must make sure
+ * there's nothing here that could confuse any future
+ * iteration of the outer loop.
+ */
+ prev[start_column] = max_d + 1;
+ curr[start_column] = max_d + 1;
+ if (start_column != 0)
+ source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+ start_column++;
+ }
+
+ /* If they cross, we're going to exceed the bound. */
+ if (start_column >= stop_column)
+ return max_d + 1;
+ }
+#endif
+ }
+
+ /*
+ * Because the final value was swapped from the previous row to the
+ * current row, that's where we'll find it.
+ */
+ return prev[m - 1];
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c3171b5..4b9e62a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1546,6 +1546,31 @@ varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid)
return result;
}
+/*
+ * varstr_leven()
+ * varstr_leven_less_equal()
+ * Levenshtein distance functions. All arguments should be strlen(s) <= 255.
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
+ *
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+ while (len > 0)
+ {
+ len--;
+ if (s1[len] != s2[len])
+ return false;
+ }
+ return true;
+}
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
/* text_cmp()
* Internal comparison function for text strings.
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
int rtelevelsup);
extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
- char *colname, int location);
+ char *colname, int location, AttrNumber *matchedatt,
+ int *distance);
extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
int location);
extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 4e74d85..0abe9bf 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -786,6 +786,11 @@ extern Datum textoverlay_no_len(PG_FUNCTION_ARGS);
extern Datum name_text(PG_FUNCTION_ARGS);
extern Datum text_name(PG_FUNCTION_ARGS);
extern int varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid);
+extern int varstr_leven(const char *source, int slen, const char *target,
+ int tlen, int ins_c, int del_c, int sub_c);
+extern int varstr_leven_less_equal(const char *source, int slen,
+ const char *target, int tlen, int ins_c,
+ int del_c, int sub_c, int max_d);
extern List *textToQualifiedNameList(text *textval);
extern bool SplitIdentifierString(char *rawstring, char separator,
List **namelist);
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..b24fa43 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
-- add a check constraint (fails)
alter table atacc1 add constraint atacc_test1 check (test1>3);
ERROR: column "test1" does not exist
+HINT: Perhaps you meant to reference the column "atacc1"."test".
drop table atacc1;
-- something a little more complicated
create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
+HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altstartwith;
^
+HINT: Perhaps you meant to reference the column "altstartwith"."col".
select * from altstartwith;
col
-----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altinhoid;
^
+HINT: Perhaps you meant to reference the column "altinhoid"."col".
select * from altwithoid;
col
-----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".
select oid > 0, * from altinhoid;
?column? | col
----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2501184..3ef5580 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
200 | 1000 | 200 | 2001
(5 rows)
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR: column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+ ^
+HINT: Perhaps you meant to reference the column "t3"."x".
--
-- regression test for 8.1 merge right join bug
--
@@ -3415,6 +3421,39 @@ select * from
(0 rows)
--
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR: column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR: column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR: column "uunique1" does not exist
+LINE 1: select uunique1 from
+ ^
+HINT: Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+ERROR: column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+ ^
+HINT: Perhaps you meant to reference the column "atts"."indexrelid".
+--
-- Test LATERAL
--
select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 983f1b8..fb4abe6 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4782,6 +4782,7 @@ END$$;
ERROR: column "foo" does not exist
LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
^
+HINT: Perhaps you meant to reference the column "room"."roomno".
QUERY: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
CONTEXT: PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
-- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname; -- error
ERROR: column fullname.text does not exist
LINE 1: select fullname.text from fullname;
^
+HINT: Perhaps you meant to reference the column "fullname"."last".
-- same, but RECORD instead of named composite type:
select cast (row('Jim', 'Beam') as text);
row
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c79b45c..01c80af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2396,6 +2396,7 @@ select xmin, * from fooview; -- fail, views don't have such a column
ERROR: column "xmin" does not exist
LINE 1: select xmin, * from fooview;
^
+HINT: Perhaps you meant to reference the column "fooview"."x".
select reltoastrelid, relkind, relfrozenxid
from pg_class where oid = 'fooview'::regclass;
reltoastrelid | relkind | relfrozenxid
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
ERROR: column "oid" does not exist
LINE 1: SELECT count(oid) FROM wo;
^
+HINT: Perhaps you meant to reference the column "wo"."i".
VACUUM ANALYZE wi;
VACUUM ANALYZE wo;
SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 718e1d9..ca7f966 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
select * from t1 left join t2 on (t1.a = t2.a);
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
--
-- regression test for 8.1 merge right join bug
--
@@ -1051,6 +1055,26 @@ select * from
int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
--
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+ tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+ pg_attribute a on s.attname = a.attname and s.tablename =
+ a.attrelid::regclass::text join (select unnest(indkey) attnum,
+ indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+ schemaname != 'pg_catalog';
+--
-- Test LATERAL
--
--
1.9.1
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com>
wrote:Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].
FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
--
Michael
On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
+#include "levenshtein.c" +#define LEVENSHTEIN_LESS_EQUAL +#include "levenshtein.c" +#undef LEVENSHTEIN_LESS_EQUAL
Okay, but this is the coding that currently appears within contrib's
fuzzystrmatch.c, more or less unchanged. The "#undef
LEVENSHTEIN_LESS_EQUAL" line that I added ought to be unnecessary.
I'll give the final word on that to whoever picks this up.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.
Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
--
Michael
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
Why not?
You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.
What does that leave?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
Why not?
You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.What does that leave?
I see you've marked this "Needs Review", even though your previously
marked it "Ready for Committer" a few months back (Robert marked it
"Waiting on Author" very recently because of the compiler warning, and
then I marked it back to "Ready for Committer" once that was
addressed, before you finally marked it back to "Needs Review" and
removed yourself as the reviewer just now).
I'm pretty puzzled by this. Other than our "agree to disagree and
defer to committer" position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1]/messages/by-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com, I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2]: /messages/by-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com -- Peter Geoghegan
Did I miss something else?
[1]: /messages/by-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com
[2]: /messages/by-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].
I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.
On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs. I think you need to
change that if you want to get anywhere with this. On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting. I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.
Okay, thanks. I think I can do that fairly soon.
On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs. I think you need to
change that if you want to get anywhere with this.
Alright. It wasn't as if I felt very strongly about it either way.
On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting. I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.
Sure.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.Okay, thanks. I think I can do that fairly soon.
Attached patch moves the Levenshtein distance implementation into core.
You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.
--
Peter Geoghegan
Attachments:
.0001-Move-Levenshtein-distance-implementation-into-core.patch.swpapplication/octet-stream; name=.0001-Move-Levenshtein-distance-implementation-into-core.patch.swpDownload
b0VIM 7.4 S�cTRT\ -{ pg hamster ~pg/postgresql/0001-Move-Levenshtein-distance-implementation-into-core.patch
3210 #"! U tp
h ��������o i ��������r � ��������r J ��������{ � ��������Q 7 ��������u � ��������~ � ��������} { � ad
� h � � m % $ � � Z ! �
�
H
� � g 8 � � � T �
�
�
x
S
.
� � � � o ? � � � n I ! � � � � o l Z V L 1 � � � � � | R � � � � ^ H 2 ! � � � ` < � � � � � z w P ) ' � � � � � � g J - � � � + s_bytes = VARSIZE_ANY_EXHDR(src); + /* Determine length of each string in bytes and characters */ + t_data = VARDATA_ANY(dst); + s_data = VARDATA_ANY(src); + /* Extract a pointer to the actual character data */ + + t_bytes; + int s_bytes, + const char *t_data; + const char *s_data; - PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1)); - text *dst = PG_GETARG_TEXT_PP(1); text *src = PG_GETARG_TEXT_PP(0); { @@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS) } + del_c, sub_c)); + PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c, + + t_bytes = VARSIZE_ANY_EXHDR(dst); + s_bytes = VARSIZE_ANY_EXHDR(src); + /* Determine length of each string in bytes and characters */ + t_data = VARDATA_ANY(dst); + s_data = VARDATA_ANY(src); + /* Extract a pointer to the actual character data */ + + t_bytes; + int s_bytes, + const char *t_data; + const char *s_data; - PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c)); - int sub_c = PG_GETARG_INT32(4); int del_c = PG_GETARG_INT32(3); int ins_c = PG_GETARG_INT32(2); @@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS) levenshtein_with_costs(PG_FUNCTION_ARGS) Datum PG_FUNCTION_INFO_V1(levenshtein_with_costs); - -#include "levenshtein.c"