Doing better at HINTing an appropriate column within errorMissingColumn()

Started by Peter Geogheganalmost 12 years ago147 messages
#1Peter Geoghegan
pg@heroku.com
1 attachment(s)

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

I have some outstanding concerns about the proposed patch:

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.

Thoughts?
--
Peter Geoghegan

Attachments:

levenshtein_column_hint.v1.2014_03_27.patch.gzapplication/x-gzip; name=levenshtein_column_hint.v1.2014_03_27.patch.gzDownload
#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Hello

I see only one risk - it can do some slowdown of exception processing.

Sometimes you can have a code like

BEGIN
WHILE ..
LOOP
BEGIN
INSERT INTO ...
EXCEPTION WHEN ..
; /* ignore this error */
END;
END LOOP;

without this risks, proposed feature is nice, but should be fast

Regards

Pavel

2014-03-27 20:10 GMT+01:00 Peter Geoghegan <pg@heroku.com>:

Show quoted text

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate

---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

I have some outstanding concerns about the proposed patch:

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.

Thoughts?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Geoghegan
pg@heroku.com
In reply to: Pavel Stehule (#2)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I see only one risk - it can do some slowdown of exception processing.

I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Geoghegan (#3)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

2014-03-28 9:22 GMT+01:00 Peter Geoghegan <pg@heroku.com>:

On Fri, Mar 28, 2014 at 1:00 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

I see only one risk - it can do some slowdown of exception processing.

I think it's unlikely that you'd see ERRCODE_UNDEFINED_COLUMN in
procedural code like that in practice. In any case it's worth noting
that I continually pass back a "max" to the Levenshtein distance
implementation, which is the current shortest distance observed. The
implementation is therefore not obliged to exhaustively find a
distance that is already known to be of no use. See commit 604ab0.

if it is related to ERRCODE_UNDEFINED_COLUMN then it should be ok (from
performance perspective)

but second issue can be usage from plpgsql - where is mix SQL identifiers
and plpgsql variables.

Pavel

Show quoted text

--
Peter Geoghegan

#5Oleg Bartunov
obartunov@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Very interesting idea, I'd think about optionally add similarity hinting
support to psql tab. With, say, 80% of similarity matching, it
shouldn't be very annoying. For interactive usage there is no risk of
slowdown.
On Mar 27, 2014 11:11 PM, "Peter Geoghegan" <pg@heroku.com> wrote:

Show quoted text

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate

---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

I have some outstanding concerns about the proposed patch:

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

* Perhaps I should have moved the Levenshtein distance functions into
core and be done with it. I thought that given the present restriction
that the implementation imposes on source and target string lengths,
it would be best to leave the user-facing SQL functions in contrib.
That restriction is not relevant to the internal use of Levenshtein
distance added here, though.

Thoughts?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Albe Laurenz
laurenz.albe@wien.gv.at
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name.

[local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".

This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwim

Maybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Christoph Berg
cb@df7cb.de
In reply to: Albe Laurenz (#6)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Re: Albe Laurenz 2014-03-28 <A737B7A37273E048B164557ADEF4A58B17CE8DEA@ntex2010i.host.magwien.gv.at>

ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".

This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwim

Maybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.

I find it very annoying when zsh asks me "did you mean foo [y/n]" and
I need to confirm that, but I'd find a mere HINT that I can easily
ignore a very useful feature. +1 for the idea.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

#8Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Mar 28, 2014 at 4:10 AM, Peter Geoghegan <pg@heroku.com> wrote:

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name. psql tab
completion is good, but not so good that this doesn't happen all the
time. It's good practice to consistently name columns and tables such
that it's possible to intuit the names of columns from the names of
tables and so on, but it's still pretty common to forget if a column
name from the table "orders" is "order_id", "orderid", or "ordersid",
particularly if you're someone who regularly interacts with many
databases. This problem is annoying in a low intensity kind of way.

Consider the following sample sessions of mine, made with the
dellstore2 sample database:

[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989
[local]/postgres=# select * from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
orderid | orderdate | customerid | netamount | tax | totalamount |
orderlineid | orderid | prod_id | quantity | orderdate
---------+------------+------------+-----------+-------+-------------+-------------+---------+---------+----------+------------
1 | 2004-01-27 | 7888 | 313.24 | 25.84 | 339.08 |
1 | 1 | 9117 | 1 | 2004-01-27
(1 row)

[local]/postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid limit 1;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2999
[local]/postgres=# select ol.ordersid from orders o join orderlines ol
on o.orderid = ol.orderid limit 1;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:2989

We try to give the most useful possible HINT here, charging extra for
a non-matching alias, and going through the range table in order and
preferring the first column observed to any subsequent column whose
name is of the same distance as an earlier Var. The fuzzy string
matching works well enough that it seems possible in practice to
successfully have the parser make the right suggestion, even when the
user's original guess was fairly far off. I've found it works best to
charge half as much for a character deletion, so that's what is
charged.

What about the overhead that this processing creates if error
processing needs to scan a schema with let's say hundreds of tables?

* It may be the case that dense logosyllabic or morphographic writing
systems, for example Kanji might consistently present, say, Japanese
users with a suggestion that just isn't very useful, to the point of
being annoying. Perhaps some Japanese hackers can comment on the
actual risks here.

As long as Hiragana-only words (basic alphabet for Japanese words),
and more particularly Katakana only-words (to write phonetically
foreign words) are compared (even Kanji-only things compared),
Levenstein could play its role pretty well. But once a comparison is
made with two words using different alphabet, well Levenstein is not
going to work well. A simple example is 'ramen' (Japanese noodles),
that you can find written sometimes in Hiragana, or even in Katakana,
and here Levenstein performs poorly:
=# select levenshtein('ラーメン', 'らあめん');
levenshtein
-------------
4
(1 row)
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#8)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Mar 28, 2014 at 5:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

What about the overhead that this processing creates if error
processing needs to scan a schema with let's say hundreds of tables?

It doesn't work that way. I've extended searchRangeTableForCol() so
that when it calls scanRTEForColumn(), it considers Levenshtein
distance, and not just plain string equality, which is what happens
today. The code only looks through ParseState.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Albe Laurenz (#6)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Mar 28, 2014 at 4:47 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:

Peter Geoghegan wrote:

With the addition of LATERAL subqueries, Tom fixed up the mechanism
for keeping track of which relations are visible for column references
while the FROM clause is being scanned. That allowed
errorMissingColumn() to give a more useful error to the one produced
by the prior coding of that mechanism, with an errhint sometimes
proffering: 'There is a column named "foo" in table "bar", but it
cannot be referenced from this part of the query'.

I wondered how much further this could be taken. Attached patch
modifies contrib/fuzzystrmatch, moving its Levenshtein distance code
into core without actually moving the relevant SQL functions too. That
change allowed me to modify errorMissingColumn() to make more useful
suggestions as to what might have been intended under other
circumstances, like when someone fat-fingers a column name.

[local]/postgres=# select * from orders o join orderlines ol on o.orderid = ol.orderids limit 1;
ERROR: 42703: column ol.orderids does not exist
LINE 1: ...* from orders o join orderlines ol on o.orderid = ol.orderid...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".

This sounds like a mild version of DWIM:
http://www.jargondb.org/glossary/dwim

Maybe it is just me, but I get uncomfortable when a program tries
to second-guess what I really want.

It's not really DWIM, because the backend is still throwing an error.
It's just trying to help you sort out the error, along the way.
Still, I share some of your discomfort. I see Peter's patch as an
example of a broader class of things that we could do - but I'm not
altogether sure that we want to do them. There's a risk of adding not
only CPU cycles but also clutter. If we do things that encourage
people to crank the log verbosity down, I think that's going to be bad
more often than it's good. It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#10)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Apr 1, 2014 at 7:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

There's a risk of adding not
only CPU cycles but also clutter. If we do things that encourage
people to crank the log verbosity down, I think that's going to be bad
more often than it's good.

While I share your concern here, I think that this is something that
is only likely to be seen in an interactive psql session, where it is
seen quite frequently. I am reasonably confident that it's highly
unusual to see ERRCODE_UNDEFINED_COLUMN in other settings. Not having
to do a mental context switch when writing an ad-hoc query has
considerable value. Even C compilers like Clang have this kind of
feedback. This is a patch that was written out of personal
frustration with the experience of interacting with many different
databases. Things like the Python REPL don't do so much of this kind
of thing, but presumably that's because of Python's dynamic typing.
This is a HINT that can be given with fairly high confidence that
it'll be helpful - there just won't be that many things that the user
could have meant to choose from. I think it's even useful when the
suggested column is distant from the original suggestion (i.e.
errorMissingColumn() offers only what is clearly a "wild guess"),
because then the user knows that he or she has got it quite wrong.
Frequently, this will be because the wrong synonym for what should
have been written was used.

It strains credulity to think that this
patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.

Fair enough.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jim Nasby
jim@nasby.net
In reply to: Peter Geoghegan (#11)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 4/1/14, 1:04 PM, Peter Geoghegan wrote:

It strains credulity to think that this

patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.

Fair enough.

I agree with the concern, but also have to say that I can't count how many times I could have used this. A big +1, at least in this case.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Greg Stark
stark@mit.edu
In reply to: Jim Nasby (#12)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lot

It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.

--
greg
On 1 Apr 2014 17:38, "Jim Nasby" <jim@nasby.net> wrote:

Show quoted text

On 4/1/14, 1:04 PM, Peter Geoghegan wrote:

It strains credulity to think that this

patch alone would have that effect, but there might be quite a few
similar improvements that are possible. So I think it would be good
to consider how far we want to go in this direction and where we think
we might want to stop. That's not to say, let's not ever do this,
just, let's think carefully about where we want to end up.

Fair enough.

I agree with the concern, but also have to say that I can't count how many
times I could have used this. A big +1, at least in this case.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@2ndquadrant.com
In reply to: Greg Stark (#13)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 2014-04-02 21:08:47 +0100, Greg Stark wrote:

Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lot

I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.

It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.

Really? Should we now add GUCs for every feature then?

Greetings,

Andres Freund

PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#14)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.

I agree. It's just a HINT. It's quite helpful in certain particular
contexts, but in the grand scheme of things isn't all that important.
I am being quite conservative in trying to anticipate cases where on
balance it'll actually hurt more than it will help. I doubt that there
actually are any.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Greg Stark
stark@mit.edu
In reply to: Andres Freund (#14)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:

PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.

I've been responding a lot from the phone. Unfortunately the Gmail client
on the phone makes it nearly impossible to format messages well. I'm
beginning to think it would be better to just not quote at all any more.
I'm normally not doing a point-by-point response anyways.

--
greg

#17Andres Freund
andres@2ndquadrant.com
In reply to: Greg Stark (#16)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 2014-04-03 00:48:12 -0400, Greg Stark wrote:

On Wed, Apr 2, 2014 at 4:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:

PS: Could you please start to properly quote again? You seem to have
stopped doing that entirely in the last few months.

I've been responding a lot from the phone. Unfortunately the Gmail client
on the phone makes it nearly impossible to format messages well. I'm
beginning to think it would be better to just not quote at all any more.
I'm normally not doing a point-by-point response anyways.

I really don't care where you're answering from TBH. It's unreadable,
misses context and that's it. If $device doesn't work for you, don't use
it.
I don't mind an occasional quick answer that's badly formatted, but for
other things it's really annoying.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Josh Berkus
josh@agliodbs.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 04/02/2014 01:16 PM, Andres Freund wrote:

On 2014-04-02 21:08:47 +0100, Greg Stark wrote:

Normally I'm not for adding gucs that just gate new features. But I think a
simple guc to turn this on or off would be fine and alleviate any concerns.
I think users would appreciate it quite a lot

I don't have strong feelings about the feature, but introducing a guc
for it feels entirely ridiculous to me. This is a minor detail in an
error message, not more.

It would even have a positive effect of helping raise awareness of the
feature. I often scan the list of config options to get an idea of new
features when I'm installing new software or upgrading.

Really? Should we now add GUCs for every feature then?

-1 for having a GUC for this.

+1 on the feature.

Review with functional test coming up.

Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Peter Geoghegan
pg@heroku.com
In reply to: Josh Berkus (#18)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:

Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?

I think that how useful Levenshtein distance is for users based in
east Asia generally, and how useful this patch is to those users are
two distinct questions. I have no idea how common it is for Japanese
users to just use Roman characters as table and attribute names. Since
they're very probably already writing application code that uses Roman
characters (except in the comments, user strings and so on), it might
make sense to do the same in the database. I would welcome further
input on that question. I don't know what the trends are in the real
world.

Also note that the patch scans the range table parse state to pick the
most probable candidate among all Vars/columns that already appear
there. The query would raise an error at an earlier point if a
non-existent relation was referenced, for example. We're only choosing
from a minimal list of possibilities, and pick one that is very
probably what was intended. Even if Levenshtein distance works badly
with Kanji (which is not obviously the case, at least to me), it might
not matter here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Ian Barwick
ian@2ndquadrant.com
In reply to: Peter Geoghegan (#19)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 14/06/17 8:31, Peter Geoghegan wrote:

On Mon, Jun 16, 2014 at 4:04 PM, Josh Berkus <josh@agliodbs.com> wrote:

Question: How should we handle the issues with East Asian languages
(i.e. Japanese, Chinese) and this Hint? Should we just avoid hinting
for a selected list of languages which don't work well with levenshtein?
If so, how do we get that list?

I think that how useful Levenshtein distance is for users based in
east Asia generally, and how useful this patch is to those users are
two distinct questions. I have no idea how common it is for Japanese
users to just use Roman characters as table and attribute names. Since
they're very probably already writing application code that uses Roman
characters (except in the comments, user strings and so on), it might
make sense to do the same in the database. I would welcome further
input on that question. I don't know what the trends are in the real
world.

From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.

Regards

Ian Barwick

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Michael Paquier
michael.paquier@gmail.com
In reply to: Ian Barwick (#20)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:

From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.

And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#21)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:

From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.

And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..

What exactly is the rationale for thinking that Levenshtein distance is
useless in non-Roman alphabets? AFAIK it just counts insertions and
deletions of characters, which seems like a concept rather independent
of what those characters are.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Ian Barwick
ian@2ndquadrant.com
In reply to: Tom Lane (#22)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 14/06/17 9:53, Tom Lane wrote:

Michael Paquier <michael.paquier@gmail.com> writes:

On Tue, Jun 17, 2014 at 9:30 AM, Ian Barwick <ian@2ndquadrant.com> wrote:

From what I've seen in the wild in Japan, Roman/ASCII characters are
widely used for object/attribute names, as generally it's much less
hassle than switching between input methods, dealing with different
encodings etc. The only place where I've seen Japanese characters widely
used is in tutorials, examples etc. However that's only my personal
observation for one particular non-Roman language.

And I agree to this remark, that's a PITA to manage database object
names with Japanese characters directly. I have ever seen some
applications using such ways to define objects though in the past, not
*that* many I concur..

What exactly is the rationale for thinking that Levenshtein distance is
useless in non-Roman alphabets? AFAIK it just counts insertions and
deletions of characters, which seems like a concept rather independent
of what those characters are.

With Japanese (which doesn't have an alphabet, but two syllabaries and
a bunch of logographic characters), Levenshtein distance is pretty useless
for examining similarities with words which can be written in either
syllabary (Michael's "ramen" example earlier in the thread); and when
catching "typos" caused by erroneous conversion from phonetic input to
characters - e.g. intending to input "成長" (seichou, growth) but
accidentally selecting "清聴" (seichou, courteous attention).

Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.

Regards

Ian Barwick

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Peter Geoghegan
pg@heroku.com
In reply to: Ian Barwick (#23)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:

Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.

I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Ian Barwick
ian@2ndquadrant.com
In reply to: Peter Geoghegan (#24)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 14/06/17 11:57, Peter Geoghegan wrote:

On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:

Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.

I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.

I mean, does it come up with a suggestion in every case, even if there is
no remotely similar column? E.g. would

SELECT foo FROM some_table

bring up column "bar" as a suggestion if "bar" is the only column in
the table?

Anyway, is there an up-to-date version of the patch available? The one from
March doesn't seem to apply cleanly to HEAD.

Thanks

Ian Barwick

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#24)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan <pg@heroku.com> writes:

On Mon, Jun 16, 2014 at 7:09 PM, Ian Barwick <ian@2ndquadrant.com> wrote:

Howver in this particular use case, as long as it doesn't produce false
positives (I haven't looked at the patch) I don't think it would cause
any problems (of the kind which would require actively excluding certain
languages/character sets), it just wouldn't be quite as useful.

I'm not sure what you mean by false positives. The patch just shows a
HINT, where before there was none. It's possible for any number of
reasons that it isn't the most useful possible suggestion, since
Levenshtein distance is used as opposed to any other scheme that might
be better sometimes. I think that the hint given is a generally useful
piece of information in the event of an ERRCODE_UNDEFINED_COLUMN
error. Obviously I think the patch is worthwhile, but fundamentally
the HINT given is just a guess, as with the existing HINTs.

Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.

You could possibly answer this objection by making the HINT list *all*
the alternatives meeting the minimum Levenshtein distance. But I think
that's probably overcomplicated and of uncertain value anyhow. I'd rather
have a rule that "we print only the choice that is at least K units better
than any other choice", where K remains to be determined exactly.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Peter Geoghegan
pg@heroku.com
In reply to: Ian Barwick (#25)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Jun 16, 2014 at 8:38 PM, Ian Barwick <ian@2ndquadrant.com> wrote:

I mean, does it come up with a suggestion in every case, even if there is
no remotely similar column? E.g. would

SELECT foo FROM some_table

bring up column "bar" as a suggestion if "bar" is the only column in
the table?

Yes, it would, but I think that's the correct behavior.

Anyway, is there an up-to-date version of the patch available? The one from
March doesn't seem to apply cleanly to HEAD.

Are you sure? I think it might just be that patch is confused about
the deleted file contrib/fuzzystrmatch/levenshtein.c.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#26)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.

I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from. I think that it might be particularly
annoying to not show some suggestion in the event of a would-be
ambiguous column reference where the column name is itself wrong,
since both mistakes are common. For example, "order_id" was specified
instead of one of either "o.orderid" or "ol.orderid", as in my
original examples. If some correct alias was specified, that would
make the new code prefer the appropriate Var, but it might not be, and
that should be okay in my view.

I'm not trying to remove the need for human judgement here. We've all
heard stories about people who did things like input "Portland" into
their GPS only to end up in Maine rather than Oregon, but I think in
general you can only go so far in worrying about those cases.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#28)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Jun 16, 2014 at 8:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not having looked at the patch, but: I think the probability of
useless-noise HINTs could be substantially reduced if the code prints a
HINT only when there is a single available alternative that is clearly
better than the others in Levenshtein distance. I'm not sure how much
better is "clearly better", but I exclude "zero" from that. I see that
the original description of the patch says that it will arbitrarily
choose one alternative when there are several with equal Levenshtein
distance, and I'd say that's a bad idea.

I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from. I think that it might be particularly
annoying to not show some suggestion in the event of a would-be
ambiguous column reference where the column name is itself wrong,
since both mistakes are common. For example, "order_id" was specified
instead of one of either "o.orderid" or "ol.orderid", as in my
original examples. If some correct alias was specified, that would
make the new code prefer the appropriate Var, but it might not be, and
that should be okay in my view.

I'm not trying to remove the need for human judgement here. We've all
heard stories about people who did things like input "Portland" into
their GPS only to end up in Maine rather than Oregon, but I think in
general you can only go so far in worrying about those cases.

Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.

One existing tool that does something along these lines is 'git',
which seems to have some kind of a heuristic to know when to give up:

[rhaas pgsql]$ git gorp
git: 'gorp' is not a git command. See 'git --help'.

Did you mean this?
grep
[rhaas pgsql]$ git goop
git: 'goop' is not a git command. See 'git --help'.

Did you mean this?
grep
[rhaas pgsql]$ git good
git: 'good' is not a git command. See 'git --help'.
[rhaas pgsql]$ git puma
git: 'puma' is not a git command. See 'git --help'.

Did you mean one of these?
pull
push

I suspect that the maximum useful distance is a function of the string
length. Certainly, if the distance is greater than or equal to the
length of one of the strings involved, it's just a totally unrelated
string and thus not worth suggesting. A useful heuristic might be
something like "distance at most 3, or at most half the string length,
whichever is less".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#29)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Jun 17, 2014 at 12:51 AM, Peter Geoghegan <pg@heroku.com> wrote:

I disagree. I happen to think that making some guess is better than no
guess at all here, given the fact that there aren't too many
possibilities to choose from.

Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.

Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.

One existing tool that does something along these lines is 'git',
which seems to have some kind of a heuristic to know when to give up:

I wouldn't necessarily hold up git as a model of user interface
engineering ;-) ... but still, it might be interesting to take a look
at exactly what heuristics they used here. I'm sure there are other
precedents we could look at, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Kevin Grittner
kgrittn@ymail.com
In reply to: Tom Lane (#30)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wouldn't necessarily hold up git as a model of user interface
engineering ;-) ... but still, it might be interesting to take a
look at exactly what heuristics they used here.  I'm sure there
are other precedents we could look at, too.

On my Ubuntu machine, bash does something similar.  A few examples
chosen completely arbitrarily:

kgrittn@Kevin-Desktop:~$ got
No command 'got' found, did you mean:
 Command 'go' from package 'golang-go' (universe)
 Command 'gout' from package 'scotch' (universe)
 Command 'jot' from package 'athena-jot' (universe)
 Command 'go2' from package 'go2' (universe)
 Command 'git' from package 'git' (main)
 Command 'gpt' from package 'gpt' (universe)
 Command 'gom' from package 'gom' (universe)
 Command 'goo' from package 'goo' (universe)
 Command 'gst' from package 'gnu-smalltalk' (universe)
 Command 'dot' from package 'graphviz' (main)
 Command 'god' from package 'god' (universe)
 Command 'god' from package 'ruby-god' (universe)
got: command not found
kgrittn@Kevin-Desktop:~$ groupad
No command 'groupad' found, did you mean:
 Command 'groupadd' from package 'passwd' (main)
 Command 'groupd' from package 'cman' (main)
groupad: command not found
kgrittn@Kevin-Desktop:~$ asdf
No command 'asdf' found, did you mean:
 Command 'asdfg' from package 'aoeui' (universe)
 Command 'sadf' from package 'sysstat' (main)
 Command 'sdf' from package 'sdf' (universe)
asdf: command not found
kgrittn@Kevin-Desktop:~$ zxcv
zxcv: command not found

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Josh Berkus
josh@agliodbs.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 06/17/2014 01:59 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Emitting a suggestion with a large distance seems like it could be
rather irritating. If the user types in SELECT prodct_id FROM orders,
and that column does not exist, suggesting "product_id", if such a
column exists, will likely be well-received. Suggesting a column
named, say, "price", however, will likely make at least some users say
"no I didn't mean that you stupid @%!#" - because probably the issue
there is that the user selected from the completely wrong table,
rather than getting 6 of the 9 characters they typed incorrect.

Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.

Well, there's two different issues:

(1) offering a suggestion which is too different from what the user
typed. This is easily limited by having a max distance (most likely a
distance/length ratio, with a max of say, 0.5). The only drawback of
this would be the extra cpu cycles to calculate it, and some arguments
about what the max distance should be. But for the sake of the
children, let's not have a GUC for it.

(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#30)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 1:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, that's my point exactly. There's no very good reason to assume that
the intended answer is in fact among the set of column names we can see;
and if it *is* there, the Levenshtein distance to it isn't going to be
all that large. I think that suggesting "foobar" when the user typed
"glorp" is not only not helpful, but makes us look like idiots.

Maybe that's just a matter of phrasing the message appropriately. A
more guarded message, that suggests that "foobar" is the *best* match
is correct at least on its own terms (terms that are self evident).
This does pretty effectively communicate to the user that they should
totally rethink not just the column name, but perhaps the entire
query. On the other hand, showing nothing communicates nothing.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#32)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Josh Berkus <josh@agliodbs.com> writes:

(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.

I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Josh Berkus
josh@agliodbs.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 06/17/2014 02:36 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.

I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.

So if we want to list everything below a cutoff, we'd need to make that
cutoff fairly narrow, like 0.2. But that means we'd miss a lot of
potential matches on short column names.

I really think we're overthinking this: it is just a HINT, and we can
improve it in future PostgreSQL versions, and most of our users will
ignore it anyway because they'll be using a client which doesn't display
HINTs.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#33)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan <pg@heroku.com> writes:

Maybe that's just a matter of phrasing the message appropriately. A
more guarded message, that suggests that "foobar" is the *best* match
is correct at least on its own terms (terms that are self evident).
This does pretty effectively communicate to the user that they should
totally rethink not just the column name, but perhaps the entire
query. On the other hand, showing nothing communicates nothing.

I don't especially buy that argument. As soon as the user's gotten used
to hints of this sort, the absence of a hint communicates plenty.

In any case, people have now cited two different systems with suggestion
capability, and neither of them behaves as you're arguing for. The lack
of precedent should give you pause, unless you can point to widely-used
systems that do what you have in mind.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#35)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Josh Berkus <josh@agliodbs.com> writes:

On 06/17/2014 02:36 PM, Tom Lane wrote:

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.

I think Levenshtein distances are integers, though that's just a minor
point.

So if we want to list everything below a cutoff, we'd need to make that
cutoff fairly narrow, like 0.2. But that means we'd miss a lot of
potential matches on short column names.

I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.

I really think we're overthinking this: it is just a HINT, and we can
improve it in future PostgreSQL versions, and most of our users will
ignore it anyway because they'll be using a client which doesn't display
HINTs.

Agreed that we can make it better later. But whether it prints exactly
one suggestion, and whether it does that no matter how silly the
suggestion is, are rather fundamental decisions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Josh Berkus
josh@agliodbs.com
In reply to: Peter Geoghegan (#1)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 06/17/2014 02:53 PM, Tom Lane wrote:

Josh Berkus <josh@agliodbs.com> writes:

On 06/17/2014 02:36 PM, Tom Lane wrote:

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

Well, that depends on what the cutoff is. If it's high, like 0.5, that
could be a LOT of columns. Like, I plan to test this feature with a
3-table join that has a combined 300 columns. I can completely imagine
coming up with a string which is within 0.5 or even 0.3 of 40 columns names.

I think Levenshtein distances are integers, though that's just a minor
point.

I was giving distance/length ratios. That is, 0.5 would mean that up to
50% of the characters could be replaced/changed. 0.2 would mean that
only one character could be changed at lengths of five characters. Etc.

The problem with these ratios is that they behave differently with long
strings than short ones. I think realistically we'd need a double
threshold, i.e. ( distance >= 2 OR ratio <= 0.4 ). Otherwise the
obvious case, getting two characters wrong in a 4-character column name
(or one in a two character name), doesn't get a HINT.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#37)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.

The patch just looks for the match with the lowest distance, passing
the lowest observed distance so far as a "max" to the distance
calculation function. That could have some value in certain cases.
People have already raised general concerns about added cycles and/or
clutter.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Peter Geoghegan (#39)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 18/06/14 10:05, Peter Geoghegan wrote:

On Tue, Jun 17, 2014 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not proposing an immutable cutoff. Something that scales with the
string length might be a good idea, or we could make it a multiple of
the minimum observed distance, or probably there are a dozen other things
we could do. I'm just saying that if we have an alternative at distance
3, and another one at distance 4, it's not clear to me that we should
assume that the first one is certainly what the user had in mind.
Especially not if all the other alternatives are distance 10 or more.

The patch just looks for the match with the lowest distance, passing
the lowest observed distance so far as a "max" to the distance
calculation function. That could have some value in certain cases.
People have already raised general concerns about added cycles and/or
clutter.

How about a list of miss spellings and the likely targets.
(grop, grap, ...) ==> (grep, grape, grope...)
type of thing? Possibly with some kind of adaptive learning algorithm.

I suspect that while this might be a useful research project, it is out
of scope for the current discussion!

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 5:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Josh Berkus <josh@agliodbs.com> writes:

(2) If there are multiple columns with the same levenschtien distance,
which one do you suggest? The current code picks a random one, which
I'm OK with. The other option would be to list all of the columns.

I objected to that upthread. I don't think that picking a random one is
sane at all. Listing them all might be OK (I notice that that seems to be
what both bash and git do).

What bash does is annoying and stupid, and any time I find a system
with that obnoxious behavior enabled I immediately disable it, so I
don't consider that a good precedent for anything. I think what the
bash algorithm demonstrates is that while it may be sane to list more
than one option, listing 10 or 20 or 150 is unbearably obnoxious.
Filling the user's *entire terminal window* with a list of suggestions
when they make a minor typo is more like a punishment than an aid.
git's behavior of limiting itself to one or two options, while
somewhat useless, is at least not annoying.

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

Well, we've got lots of heuristics. Many of them serve us quite well.
I might do something like this:

(1) Set the maximum levenshtein distance to half the length of the
string, rounded down, but not more than 3.
(2) If there are more than 2 matches, reduce the maximum distance by 1
and repeat this step.
(3) If there are no remaining matches, print no hint; else print the 1
or 2 matching items.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#41)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jun 17, 2014 at 5:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

What bash does is annoying and stupid, and any time I find a system
with that obnoxious behavior enabled I immediately disable it, so I
don't consider that a good precedent for anything.

I happen to totally agree with you here. Bash sometimes does awful
things with its completion.

Another issue is whether to print only those having exactly the minimum
observed Levenshtein distance, or to print everything less than some
cutoff. The former approach seems to me to be placing a great deal of
faith in something that's only a heuristic.

Well, we've got lots of heuristics. Many of them serve us quite well.
I might do something like this:

(1) Set the maximum levenshtein distance to half the length of the
string, rounded down, but not more than 3.
(2) If there are more than 2 matches, reduce the maximum distance by 1
and repeat this step.
(3) If there are no remaining matches, print no hint; else print the 1
or 2 matching items.

I could do that. I can prepare a revision if others feel that's
acceptable. My only concern with this is that a more sophisticated
scheme implies more clutter in the parser, although it should not
imply wasted cycles.

What I particularly wanted to avoid in our choice of completion scheme
is doing nothing because there is an ambiguity about what is best,
which Tom suggested. In practice, that ambiguity will frequently be
something that our users will not care about, and not really see as an
ambiguity, as in my "o.orderid or ol.orderid?" example. However, if
there are 3 equally distant Vars, and not just 2, that's very probably
because none are useful, and so we really ought to show nothing. This
seems most sensible.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Peter Geoghegan (#42)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

So, what's the status of this patch?

There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Abhijit Menon-Sen (#43)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:

So, what's the status of this patch?
There's been quite a lot of discussion (though only about the approach;
no formal code/usage review has yet been posted), but as far as I can
tell, it just tapered off without any particular consensus.

AFAICT, people generally agree that this would probably be useful,
but there's not consensus on how far the code should be willing to
"reach" for a match, nor on what to do when there are multiple
roughly-equally-plausible candidates.

Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that. I still
think though that printing candidates with very large distances
would be unhelpful.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#44)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sun, Jun 29, 2014 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Although printing all candidates seems to be what's preferred by
existing systems with similar facilities, I can see the point that
constructing the message in a translatable fashion might be difficult.
So personally I'd be willing to abandon insistence on that. I still
think though that printing candidates with very large distances
would be unhelpful.

Attached revision factors in everyone's concerns here, I think.

I've addressed your concern about the closeness of the match proposed
in the HINT - the absolute as opposed to relative quality of the
match. There is a normalized distance threshold that must always be
exceeded to prevent ludicrous suggestions. This works along similar
lines to those sketched by Robert. Furthermore, I've made it
occasionally possible to see 2 suggestions, when they're equally
distant and when each suggestion comes from a different range table
entry. However, if the two best suggestions (overall or within an RTE)
come from within the same RTE, then that RTE is ignored for the
purposes of picking a suggestion (although the lowest observed
distance from an ignored RTE may still be used as the distance for
later RTEs to beat to get their attributes suggested in the HINT).

The idea here is that this quality-bar for suggestions doesn't come at
the cost of ignoring my concern about the presumably somewhat common
case where there is an unqualified and therefore ambiguous column
reference that happens to also be misspelled. An ambiguous column
reference and an incorrectly spelled column name are both very common,
and so it seems likely that momentary lapses where the user gets both
things wrong at once are also common. We do all this without going
overboard, since as outlined by Robert, when there are 3 or more
equally distant candidates (even if they all come from different
RTEs), we give no HINT at all. The big picture here is to make mental
context switches cheap when writing ad-hoc queries in psql.

A lot of the HINTs that popped up in the regression tests that seemed
kind of questionable no longer appear. These new measures make the
coding somewhat more complex than that of the initial version,
although overall the parser code added by this patch is almost
entirely confined to code paths concerned only with producing
diagnostic messages to help users.

--
Peter Geoghegan

Attachments:

levenshtein_column_hint.v2.2014_07_02.patch.gzapplication/x-gzip; name=levenshtein_column_hint.v2.2014_07_02.patch.gzDownload
#46Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Peter Geoghegan (#45)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:

Attached revision factors in everyone's concerns here, I think.

Is anyone planning to review Peter's revised patch?

These new measures make the coding somewhat more complex than that of
the initial version, although overall the parser code added by this
patch is almost entirely confined to code paths concerned only with
producing diagnostic messages to help users.

Yes, the new patch looks quite a bit more involved than earlier, but if
that's what it takes to provide a useful HINT, I guess it's not too bad.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Michael Paquier
michael.paquier@gmail.com
In reply to: Abhijit Menon-Sen (#46)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sat, Jul 5, 2014 at 12:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

At 2014-07-02 15:51:08 -0700, pg@heroku.com wrote:

Attached revision factors in everyone's concerns here, I think.

Is anyone planning to review Peter's revised patch?

I have been doing some functional tests, and looked quickly at the
code to understand what it does:
1) Compiles without warnings, passes regression tests
2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.
3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.
4) This is not nice, could it be possible to remove the stuff from varlena.c?
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.
5) Do we want hints on system columns as well? For example here we
could get tableoid as column hint:
=# select tablepid from foo;
ERROR:  42703: column "tablepid" does not exist
LINE 1: select tablepid from foo;
               ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
Time: 0.425 ms
6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR:  42703: column "ab" does not exist
LINE 1: select ab from foo;
               ^
LOCATION:  errorMissingColumn, parse_relation.c:3123
7) Performance penalty with a table with 1600 columns:
=# CREATE FUNCTION create_long_table(tabname text, columns int)
RETURNS void
LANGUAGE plpgsql
as $$
declare
  first_col bool = true;
  count int;
  query text;
begin
  query := 'CREATE TABLE ' || tabname || ' (';
  for count in 0..columns loop
    query := query || 'col' || count ||  ' int';
    if count <> columns then
      query := query || ', ';
    end if;
  end loop;
  query := query || ')';
  execute query;
end;
$$;
=# SELECT create_long_table('aa', 1599);
 create_long_table
-------------------

(1 row)
Then tested queries like that: SELECT col888a FROM aa;
Patched version: 2.100ms~2.200ms
master branch (6048896): 0.956 ms~0.990 ms
So the performance impact seems limited.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#47)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Hi,

On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

2) Checking process goes through all the existing columns of a
relation even a difference of 1 with some other column(s) has already
been found. As we try to limit the number of hints returned, this
seems like a waste of resources.

In general it's possible that an exact match will later be found
within the RTE, and exact matches don't have to pay the "wrong alias"
penalty, and are immediately returned. It is therefore not a waste of
resources, but even if it was that would be pretty inconsequential as
your benchmark shows.

3) distanceName could be improved, by for example having some checks
on the string lengths of target and source columns, and immediately
reject the match if for example the length of the source string is the
double/half of the length of target.

I don't think it's a good idea to tie distanceName() to the ultimate
behavior of errorMissingColumn() hinting, since there may be other
callers in the future. Besides, that isn't going to help much.

4) This is not nice, could it be possible to remove the stuff from varlena.c?
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I think.

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

5) Do we want hints on system columns as well?

I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123

That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#48)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123

That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.

It seems pretty silly to me actually. Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...

Words fail me.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#49)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 1:42 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

That's because those two candidates come from a single RTE and have an
equal distance -- you'd see both suggestions if you joined two tables
with each candidate, assuming that each table being joined didn't
individually have the same issue. I think that that's probably
considered the correct behavior by most.

It seems pretty silly to me actually. Was this designed by a committee?
I agree with the general principle that showing a large number of
candidates (a la bash) is a bad idea, but failing to show two of them ...

I guess it was designed by a committee. But we don't fail to show both
because they're equally distant. Rather, it's because they're equally
distant and from the same RTE. This is a contrived example, but
typically showing equally distant columns is useful when they're in a
foreign-key relationship - I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't. I think that
in most realistic cases it wouldn't be all that useful to show two
columns from the same table when they're equally distant. It's easy to
imagine that reflecting that no match is good in absolute terms, and
we're somewhat conservative about showing any match. While I think
this general behavior is defensible, I must admit that it did suit me
to write it that way because to do otherwise would have necessitated
more invasive code in the existing general purpose scanRTEForColumn()
function.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#50)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#51)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.
--
Michael

#53Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#48)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 1:49 AM, Peter Geoghegan <pg@heroku.com> wrote:

4) This is not nice, could it be possible to remove the stuff from

varlena.c?

+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
Part of the same comment: only varstr_leven_less_equal is used to
calculate the distance, should we really move varstr_leven to core?
This clearly needs to be reworked as not just a copy-paste of the
things in fuzzystrmatch.
The flag LEVENSHTEIN_LESS_EQUAL should be let within fuzzystrmatch I

think.

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this
is necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?

5) Do we want hints on system columns as well?
I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid
by tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on
(bad morning). I would have welcomed such hints in those cases.
--
Michael

#54Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#52)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 11:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.

Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#53)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So there'd be one variant within core and one within
contrib/fuzzystrmatch? I don't think that's an improvement.

No. The main difference between varstr_leven_less_equal and varstr_leven is
the use of the extra argument max_d in the former. My argument here is
instead of blindly cut-pasting into core the code you are interested in to
evaluate the string distances, is to refactor it to have a unique function,
and to let the business with LEVENSHTEIN_LESS_EQUAL within
contrib/fuzzystrmatch. This will require some reshuffling of the distance
function, but by looking at this patch I am getting the feeling that this is
necessary, and should even be split into a first patch for fuzzystrmatch
that would facilitate its integration into core.
Also why is rest_of_char_same within varlena.c?

Just as before, rest_of_char_same() exists for the express purpose of
being called by the two variants varstr_leven_less_equal() and
varstr_leven(). Why wouldn't I copy it over too along with those two?
Where do you propose to put it?

Obviously the existing macro hacks (that I haven't changed) that build
the two variants are not terribly pretty, but they're not arbitrary
either. They reflect the fact that there is no natural way to add
callbacks or something like that. If you pretended that the core code
didn't have to care about one case or the other, and that contrib was
somehow obligated to hook in its own handler for the
!LEVENSHTEIN_LESS_EQUAL case that it now only cares about, then you'd
end up with an even bigger mess. Besides, with the patch the core code
is calling varstr_leven_less_equal(), which is the bigger of the two
variants - it's the LEVENSHTEIN_LESS_EQUAL case, not the
!LEVENSHTEIN_LESS_EQUAL case that core cares about for the purposes of
building HINTs. In short, I don't know what you mean. What would that
reshuffling actually look like?

5) Do we want hints on system columns as well?

I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid by
tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on (bad
morning).

I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#55)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

On Tue, Jul 8, 2014 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

5) Do we want hints on system columns as well?

I think it's obvious that the answer must be no. That's going to
frequently result in suggestions of columns that users will complain
aren't even there. If you know about the system columns, you can just
get it right. They're supposed to be hidden for most purposes.

This may sound ridiculous, but I have already found myself mistyping ctid by
tid and cid while working on patches and modules that played with page
format, and needing a couple of minutes to understand what was going on (bad
morning).

I think that it's clearly not worth it, even if it is true that a
minority sometimes make this mistake. Most users don't know that there
are system columns. It's not even close to being worth it to bring
that into this.

I agree with Peter. This is targeted at regular users.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#52)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 2:10 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jul 9, 2014 at 5:56 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Jul 8, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:

I was worried about the common case where a
column name is misspelled that would otherwise be ambiguous, which is
why that shows a HINT while the single RTE case doesn't

To be clear - I mean a HINT with two suggestions rather than just one.
If there are 3 or more equally distant suggestions (even if they're
all from different RTEs) we also give no HINT in the proposed patch.

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.

Me, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Greg Stark
stark@mit.edu
In reply to: Peter Geoghegan (#54)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 7:29 AM, Peter Geoghegan <pg@heroku.com> wrote:

Everyone is going to have an opinion on something like that. I was
showing deference to the general concern about the absolute (as
opposed to relative) quality of the HINTs in the event of equidistant
matches by having no two suggestions come from within a single RTE,
while still covering the case I thought was important by having two
suggestions if there were two equidistant matches across RTEs. I think
that's marginally better then what you propose, because your case
deals with two equidistant though distinct columns, bringing into
question the validity of both would-be suggestions. I'll defer to
whatever the consensus is.

I agree this is bike shedding. But as long as we're bike shedding...

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance. If they match exactly except for the case and the
unmatched column is all lower case add a comment that quoting is
required due to the mixed case.

Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Peter Geoghegan
pg@heroku.com
In reply to: Greg Stark (#58)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance.

That's harder than it sounds. You need even more translatable strings
for variant ereports(). I don't think that an easy to understand rule
is necessarily of much value - I'm already charging half price for
deletion because I found representative errors more useful in certain
cases by doing so. I think we want something that displays the most
useful suggestion as often as is practically possible, and does not
display unhelpful suggestions to the extent that it's practical to
avoid them. Plus, as I mentioned, I'm keen to avoid adding more stuff
to scanRTEForColumn() than I already have.

Honestly the current logic and the previous logic both seemed
reasonable to me. They're not going to be perfect in every case so
anything that comes up some some suggestions is fine.

I think that the most recent revision is somewhat better due to the
feedback of Tom and Robert. I didn't feel as strongly as they did
about erring on the side of not showing a HINT, but I think the most
recent revision is a good compromise. But yes, at this point we're
certainly chasing diminishing returns. There are almost any number of
variants of this basic idea that could be suggested.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#57)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Showing up to 2 hints is fine as it does not pollute the error output with
perhaps unnecessary messages. That's even more protective than for example
git that prints all the equidistant candidates. However I can't understand
why it does not show up hints even if there are two equidistant candidates
from the same RTE. I think it should.

Me, too.

The idea is that each RTE gets one best suggestion, because if there
are two best suggestions within an RTE they're probably both wrong.
Whereas across RTEs, it's probably just that there is a foreign key
relationship between the two (and the user accidentally failed to
qualify the particular column of interest on top of the misspelling, a
qualification that would be sufficient to have the code prefer the
qualified-but-misspelled column). Clearly if I was to do what you
suggest it would be closer to a wild guess, and Tom has expressed
concerns about that.

Now, I don't actually ensure that the column names of the two columns
(each from separate RTEs) are identical save for their would-be alias,
but that's just a consequence of the implementation. Also, as I've
mentioned, I don't want to put more stuff in scanRTEForColumn() than I
already have, due to your earlier concern about adding clutter.

I think we're splitting hairs at this point, and frankly I'll do it
that way if it gets the patch closer to being committed. While I
thought it was important to get the unqualified and misspelled case
right (which I did in the first revision, but perhaps at the expense
of Tom's concern about absolute suggestion quality), I don't feel
strongly about this detail either way.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#59)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

On Wed, Jul 9, 2014 at 8:08 AM, Greg Stark <stark@mit.edu> wrote:

A simple rule is easier for users to understand as well as to code. I
would humbly suggest the following: take all the unqualified column
names, downcase them, check which ones match most closely the
unmatched column. Show the top 3 matches if they're within some
arbitrary distance.

That's harder than it sounds. You need even more translatable strings
for variant ereports().

Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are. For
instance something like "perhaps you meant a name from the following
list: foo, bar, baz". Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#61)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 2:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

That's harder than it sounds. You need even more translatable strings
for variant ereports().

Maybe it is possible to rephrase the message so that the translatable
part doesn't need to concern with how many suggestions there are. For
instance something like "perhaps you meant a name from the following
list: foo, bar, baz". Couple with the errmsg_plural stuff, you then
don't need to worry too much about providing different strings for 1, 2,
N suggestions.

That's not really the problem. I already have a lot of things to test
in each of the two ereport() calls. More importantly, showing the
closet, say, 3 matches under an arbitrary distance does not weigh
concerns about that indicating that they're all bad. It's not like
bash tab completion - if there is one best match, that's probably
because that's what the user meant. Whereas if there are two or more
within a single RTE, that's probably because both are unhelpful. They
both happened to require the same number of substitutions to get to,
while not being quite bad enough matches to be excluded by the final
check against a normalized distance threshold (the final check that
prevents ludicrous suggestions).

The fact that there were multiple equally plausible candidates (that
are not identically named and just from different RTEs) tells us
plenty, unlike with tab completion. It's not hard for one column to be
a better match than another, and so it doesn't seem unreasonable to
insist upon that within a single RTE where they cannot be identical,
since a conservative approach seems to be what is generally favored.
In any case I'm just trying to weigh everyone's concerns here. I hope
it's actually possible to compromise, but right now I don't know what
I can do to make useful progress.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#47)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Jul 8, 2014 at 6:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

6) Sometimes no hints are returned... Even in simple cases like this one:
=# create table foo (aa int, bb int);
CREATE TABLE
=# select ab from foo;
ERROR: 42703: column "ab" does not exist
LINE 1: select ab from foo;
^
LOCATION: errorMissingColumn, parse_relation.c:3123

In this example, it seems obvious that both "aa" and "bb" should be
suggested when they are not. But what if there were far more columns,
as might be expected in realistic cases (suppose all other columns
have at least 3 characters)? That's another kettle of fish. The
assumption that it's probably one of those two equally distant columns
is now on very shaky ground. After all, the user can only have meant
one particular column. If we apply a limited kind of Turing test to
this second case, how does the most recent revision's algorithm do?
What would a human suggest? I'm pretty sure the answer is that the
human would shrug. Maybe he or she would say "I guess you might have
meant one of either aa or bb, but that really isn't obvious at all".
That doesn't inspire much confidence.

Now, maybe I should be more optimistic about it being one of the two
because there are only two possibilities to begin with. That seems
pretty dubious, though. In general I find it much more plausible based
on what we know that the user should rethink everything. And, as Tom
pointed out, showing nothing conveys something in itself once users
have been trained to expect something.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#55)
2 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 9, 2014 at 3:56 PM, Peter Geoghegan <pg@heroku.com> wrote:

What would that reshuffling actually look like?

Something like the patch 1 attached...

Btw, re-reading this thread, everybody seem to agree that this is a useful
feature, but we still do not have clear definitions of the circumstances
under which column hints should be produced, except the number (up to two).
So, putting my hands on it and biting the bullet, I have finished with the
two patches attached making the implementation clearer:
- Patch 1 moves levenshtein functions from fuzzystrmatch to core.
- Patch 2 implements the column hints, rather unchanged from original
proposition.

Patch 1 does a couple of things:
- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
it anymore, and moved to core.
- Removal of the LESS_EQUAL flag that made the original submission patch
harder to understand. All the Levenshtein functions wrap a single common
function.
- Documentation is moved, and regression tests for Levenshtein functions
are added.
- Functions with costs are renamed with a suffix with costs.
After hacking this feature, I came up with the conclusion that it would be
better for the user experience to move directly into backend code all the
Levenshtein functions, instead of only moving in the common wrapper as
Peter did in his original patches. This is done this way to avoid keeping
portions of the same feature in two different places of the code (backend
with common routine, fuzzystrmatch with levenshtein functions) and
concentrate all the logic in a single place. Now, we may as well consider
renaming the levenshtein functions into smarter names, like str_distance,
and keep fuzzystrmatch to 1.0, having the functions levenshteing_* calling
only the str_distance functions.

Having a set of in-core distance functions for strings would serve more
general purposes like other object hinting (constraint names, tables, etc.).

Patch 2 is a rebase of the feature of Peter that can be applied on top of
patch 1. The code is rather untouched (haven't much played with Peter's
thingies), well-commented, but I think that this needs more work,
particularly when a query has a single RTE like in this case where no hints
are proposed to the user (mentioned upthread):
create table foo (aa int, bb int);
select ab from foo; -- no hints

Before doing anything more with patch 2, we still need to define clearly
how hints should be produced, so that's clearly out-of-scope for this CF.
Patch 1, though, prepares the field for hints of all kinds, so perhaps we
could argue more on that first?

Regards,
--
Michael

Attachments:

0001-Move-Levenshtein-functions-to-core.patchtext/x-diff; charset=US-ASCII; name=0001-Move-Levenshtein-functions-to-core.patchDownload
From 9a70369cf792f23a556944ec40bbf61f26f514a6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:54:24 +0900
Subject: [PATCH 1/2] Move Levenshtein functions to core

All the functions, part of fuzzystrmatch, able to evaluate distances
between strings are moved into core:
- levenshtein
- levenshtein_less_equal
In order to unify the names of the functions in catalogs, the functions
with costs are appended a prefix *_with_costs.

Documentation, as well as regression tests are added. fuzzystrmatch is
dumped to 1.1 at the same occasion.
---
 contrib/fuzzystrmatch/Makefile                    |   6 +-
 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql |   9 +
 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql      |  44 --
 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql      |  28 ++
 contrib/fuzzystrmatch/fuzzystrmatch.c             |  69 ---
 contrib/fuzzystrmatch/fuzzystrmatch.control       |   2 +-
 contrib/fuzzystrmatch/levenshtein.c               | 403 ----------------
 doc/src/sgml/func.sgml                            | 178 +++++--
 doc/src/sgml/fuzzystrmatch.sgml                   |  66 ---
 src/backend/utils/adt/Makefile                    |   4 +-
 src/backend/utils/adt/levenshtein.c               | 543 ++++++++++++++++++++++
 src/include/catalog/pg_proc.h                     |  10 +
 src/include/utils/builtins.h                      |   6 +
 src/include/utils/levenshtein.h                   |  28 ++
 src/test/regress/expected/levenshtein.out         |  27 ++
 src/test/regress/parallel_schedule                |   2 +-
 src/test/regress/serial_schedule                  |   1 +
 src/test/regress/sql/levenshtein.sql              |   8 +
 18 files changed, 796 insertions(+), 638 deletions(-)
 create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
 delete mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
 create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c
 create mode 100644 src/include/utils/levenshtein.h
 create mode 100644 src/test/regress/expected/levenshtein.out
 create mode 100644 src/test/regress/sql/levenshtein.sql

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..3d3c773 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -4,7 +4,8 @@ MODULE_big = fuzzystrmatch
 OBJS = fuzzystrmatch.o dmetaphone.o $(WIN32RES)
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql
+DATA =	fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql \
+	fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 ifdef USE_PGXS
@@ -17,6 +18,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
new file mode 100644
index 0000000..0fca2a6
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
@@ -0,0 +1,9 @@
+/* contrib/pageinspect/fuzzystrmatch--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO 1.1" to load this file. \quit
+
+DROP FUNCTION levenshtein (text,text);
+DROP FUNCTION levenshtein (text,text,int,int,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int,int,int,int);
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
deleted file mode 100644
index 1cf9b61..0000000
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
+++ /dev/null
@@ -1,44 +0,0 @@
-/* contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
-
-CREATE FUNCTION levenshtein (text,text) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein (text,text,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION metaphone (text,int) RETURNS text
-AS 'MODULE_PATHNAME','metaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION text_soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION difference(text,text) RETURNS int
-AS 'MODULE_PATHNAME', 'difference'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone_alt (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone_alt'
-LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
new file mode 100644
index 0000000..a4861ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
+
+CREATE FUNCTION metaphone (text,int) RETURNS text
+AS 'MODULE_PATHNAME','metaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION text_soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION difference(text,text) RETURNS int
+AS 'MODULE_PATHNAME', 'difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone_alt (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone_alt'
+LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..9923c17 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -40,7 +40,6 @@
 
 #include <ctype.h>
 
-#include "mb/pg_wchar.h"
 #include "utils/builtins.h"
 
 PG_MODULE_MAGIC;
@@ -154,74 +153,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len > 0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
-PG_FUNCTION_INFO_V1(levenshtein_with_costs);
-Datum
-levenshtein_with_costs(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			ins_c = PG_GETARG_INT32(2);
-	int			del_c = PG_GETARG_INT32(3);
-	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein);
-Datum
-levenshtein(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal_with_costs);
-Datum
-levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			ins_c = PG_GETARG_INT32(2);
-	int			del_c = PG_GETARG_INT32(3);
-	int			sub_c = PG_GETARG_INT32(4);
-	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal);
-Datum
-levenshtein_less_equal(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
-}
-
 
 /*
  * Calculates the metaphone of an input string.
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index e257f09..6b2832a 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,5 +1,5 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN		255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t.  The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification.  In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound.  From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row.  This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length.  The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix.  When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c)
-#endif
-{
-	int			m,
-				n,
-				s_bytes,
-				t_bytes;
-	int		   *prev;
-	int		   *curr;
-	int		   *s_char_len = NULL;
-	int			i,
-				j;
-	const char *s_data;
-	const char *t_data;
-	const char *y;
-
-	/*
-	 * For levenshtein_less_equal_internal, we have real variables called
-	 * start_column and stop_column; otherwise it's just short-hand for 0 and
-	 * m.
-	 */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	int			start_column,
-				stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
-	/* Extract a pointer to the actual character data. */
-	s_data = VARDATA_ANY(s);
-	t_data = VARDATA_ANY(t);
-
-	/* Determine length of each string in bytes and characters. */
-	s_bytes = VARSIZE_ANY_EXHDR(s);
-	t_bytes = VARSIZE_ANY_EXHDR(t);
-	m = pg_mbstrlen_with_len(s_data, s_bytes);
-	n = pg_mbstrlen_with_len(t_data, t_bytes);
-
-	/*
-	 * We can transform an empty s into t with n insertions, or a non-empty t
-	 * into an empty s with m deletions.
-	 */
-	if (!m)
-		return n * ins_c;
-	if (!n)
-		return m * del_c;
-
-	/*
-	 * For security concerns, restrict excessive CPU+RAM usage. (This
-	 * implementation uses O(m) memory and has O(mn) complexity.)
-	 */
-	if (m > MAX_LEVENSHTEIN_STRLEN ||
-		n > MAX_LEVENSHTEIN_STRLEN)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("argument exceeds the maximum length of %d bytes",
-						MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	/* Initialize start and stop columns. */
-	start_column = 0;
-	stop_column = m + 1;
-
-	/*
-	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
-	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
-	 * enough to limit the computation we must perform.  If so, figure out
-	 * initial stop column.
-	 */
-	if (max_d >= 0)
-	{
-		int			min_theo_d; /* Theoretical minimum distance. */
-		int			max_theo_d; /* Theoretical maximum distance. */
-		int			net_inserts = n - m;
-
-		min_theo_d = net_inserts < 0 ?
-			-net_inserts * del_c : net_inserts * ins_c;
-		if (min_theo_d > max_d)
-			return max_d + 1;
-		if (ins_c + del_c < sub_c)
-			sub_c = ins_c + del_c;
-		max_theo_d = min_theo_d + sub_c * Min(m, n);
-		if (max_d >= max_theo_d)
-			max_d = -1;
-		else if (ins_c + del_c > 0)
-		{
-			/*
-			 * Figure out how much of the first row of the notional matrix we
-			 * need to fill in.  If the string is growing, the theoretical
-			 * minimum distance already incorporates the cost of deleting the
-			 * number of characters necessary to make the two strings equal in
-			 * length.  Each additional deletion forces another insertion, so
-			 * the best-case total cost increases by ins_c + del_c. If the
-			 * string is shrinking, the minimum theoretical cost assumes no
-			 * excess deletions; that is, we're starting no further right than
-			 * column n - m.  If we do start further right, the best-case
-			 * total cost increases by ins_c + del_c for each move right.
-			 */
-			int			slack_d = max_d - min_theo_d;
-			int			best_column = net_inserts < 0 ? -net_inserts : 0;
-
-			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
-			if (stop_column > m)
-				stop_column = m + 1;
-		}
-	}
-#endif
-
-	/*
-	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
-	 * we cache all the lengths before starting the main loop -- but if all
-	 * the characters in both strings are single byte, then we skip this and
-	 * use a fast-path in the main loop.  If only one string contains
-	 * multi-byte characters, we still build the array, so that the fast-path
-	 * needn't deal with the case where the array hasn't been initialized.
-	 */
-	if (m != s_bytes || n != t_bytes)
-	{
-		int			i;
-		const char *cp = s_data;
-
-		s_char_len = (int *) palloc((m + 1) * sizeof(int));
-		for (i = 0; i < m; ++i)
-		{
-			s_char_len[i] = pg_mblen(cp);
-			cp += s_char_len[i];
-		}
-		s_char_len[i] = 0;
-	}
-
-	/* One more cell for initialization column and row. */
-	++m;
-	++n;
-
-	/* Previous and current rows of notional array. */
-	prev = (int *) palloc(2 * m * sizeof(int));
-	curr = prev + m;
-
-	/*
-	 * To transform the first i characters of s into the first 0 characters of
-	 * t, we must perform i deletions.
-	 */
-	for (i = START_COLUMN; i < STOP_COLUMN; i++)
-		prev[i] = i * del_c;
-
-	/* Loop through rows of the notional array */
-	for (y = t_data, j = 1; j < n; j++)
-	{
-		int		   *temp;
-		const char *x = s_data;
-		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * In the best case, values percolate down the diagonal unchanged, so
-		 * we must increment stop_column unless it's already on the right end
-		 * of the array.  The inner loop will read prev[stop_column], so we
-		 * have to initialize it even though it shouldn't affect the result.
-		 */
-		if (stop_column < m)
-		{
-			prev[stop_column] = max_d + 1;
-			++stop_column;
-		}
-
-		/*
-		 * The main loop fills in curr, but curr[0] needs a special case: to
-		 * transform the first 0 characters of s into the first j characters
-		 * of t, we must perform j insertions.  However, if start_column > 0,
-		 * this special case does not apply.
-		 */
-		if (start_column == 0)
-		{
-			curr[0] = j * ins_c;
-			i = 1;
-		}
-		else
-			i = start_column;
-#else
-		curr[0] = j * ins_c;
-		i = 1;
-#endif
-
-		/*
-		 * This inner loop is critical to performance, so we include a
-		 * fast-path to handle the (fairly common) case where no multibyte
-		 * characters are in the mix.  The fast-path is entitled to assume
-		 * that if s_char_len is not initialized then BOTH strings contain
-		 * only single-byte characters.
-		 */
-		if (s_char_len != NULL)
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-				int			x_char_len = s_char_len[i - 1];
-
-				/*
-				 * Calculate costs for insertion, deletion, and substitution.
-				 *
-				 * When calculating cost for substitution, we compare the last
-				 * character of each possibly-multibyte character first,
-				 * because that's enough to rule out most mis-matches.  If we
-				 * get past that test, then we compare the lengths and the
-				 * remaining bytes.
-				 */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				if (x[x_char_len - 1] == y[y_char_len - 1]
-					&& x_char_len == y_char_len &&
-					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
-					sub = prev[i - 1];
-				else
-					sub = prev[i - 1] + sub_c;
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x += x_char_len;
-			}
-		}
-		else
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-
-				/* Calculate costs for insertion, deletion, and substitution. */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x++;
-			}
-		}
-
-		/* Swap current row with previous row. */
-		temp = curr;
-		curr = prev;
-		prev = temp;
-
-		/* Point to next character. */
-		y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * This chunk of code represents a significant performance hit if used
-		 * in the case where there is no max_d bound.  This is probably not
-		 * because the max_d >= 0 test itself is expensive, but rather because
-		 * the possibility of needing to execute this code prevents tight
-		 * optimization of the loop as a whole.
-		 */
-		if (max_d >= 0)
-		{
-			/*
-			 * The "zero point" is the column of the current row where the
-			 * remaining portions of the strings are of equal length.  There
-			 * are (n - 1) characters in the target string, of which j have
-			 * been transformed.  There are (m - 1) characters in the source
-			 * string, so we want to find the value for zp where (n - 1) - j =
-			 * (m - 1) - zp.
-			 */
-			int			zp = j - (n - m);
-
-			/* Check whether the stop column can slide left. */
-			while (stop_column > 0)
-			{
-				int			ii = stop_column - 1;
-				int			net_inserts = ii - zp;
-
-				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
-								-net_inserts * del_c) <= max_d)
-					break;
-				stop_column--;
-			}
-
-			/* Check whether the start column can slide right. */
-			while (start_column < stop_column)
-			{
-				int			net_inserts = start_column - zp;
-
-				if (prev[start_column] +
-					(net_inserts > 0 ? net_inserts * ins_c :
-					 -net_inserts * del_c) <= max_d)
-					break;
-
-				/*
-				 * We'll never again update these values, so we must make sure
-				 * there's nothing here that could confuse any future
-				 * iteration of the outer loop.
-				 */
-				prev[start_column] = max_d + 1;
-				curr[start_column] = max_d + 1;
-				if (start_column != 0)
-					s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
-				start_column++;
-			}
-
-			/* If they cross, we're going to exceed the bound. */
-			if (start_column >= stop_column)
-				return max_d + 1;
-		}
-#endif
-	}
-
-	/*
-	 * Because the final value was swapped from the previous row to the
-	 * current row, that's where we'll find it.
-	 */
-	return prev[m - 1];
-}
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index bf13140..979f87f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4029,20 +4029,20 @@ regexp_replace('foobarbaz', 'b(..)', E'X\\1Y', 'g')
     Some examples:
 <programlisting>
 SELECT regexp_matches('foobarbequebaz', '(bar)(beque)');
- regexp_matches 
+ regexp_matches
 ----------------
  {bar,beque}
 (1 row)
 
 SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
- regexp_matches 
+ regexp_matches
 ----------------
  {bar,beque}
  {bazil,barf}
 (2 rows)
 
 SELECT regexp_matches('foobarbequebaz', 'barbeque');
- regexp_matches 
+ regexp_matches
 ----------------
  {barbeque}
 (1 row)
@@ -4089,44 +4089,44 @@ SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
 <programlisting>
 
 SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', E'\\s+') AS foo;
-  foo   
+  foo
 -------
- the    
- quick  
- brown  
- fox    
- jumps 
- over   
- the    
- lazy   
- dog    
+ the
+ quick
+ brown
+ fox
+ jumps
+ over
+ the
+ lazy
+ dog
 (9 rows)
 
 SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', E'\\s+');
-              regexp_split_to_array             
+              regexp_split_to_array
 -----------------------------------------------
  {the,quick,brown,fox,jumps,over,the,lazy,dog}
 (1 row)
 
 SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
- foo 
+ foo
 -----
- t         
- h         
- e         
- q         
- u         
- i         
- c         
- k         
- b         
- r         
- o         
- w         
- n         
- f         
- o         
- x         
+ t
+ h
+ e
+ q
+ u
+ i
+ c
+ k
+ b
+ r
+ o
+ w
+ n
+ f
+ o
+ x
 (16 rows)
 </programlisting>
    </para>
@@ -5796,7 +5796,7 @@ SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
        Casting does not have this behavior.
       </para>
      </listitem>
-  
+
      <listitem>
       <para>
        Ordinary text is allowed in <function>to_char</function>
@@ -7893,6 +7893,88 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
    </para>
  </sect1>
 
+ <sect1 id="functions-levenshtein">
+  <title>Levenshtein functions</title>
+
+  <para>
+   Levenshtein functions provide ways to calculate a distance between two
+   strings.
+  </para>
+
+  <table id="functions-levenshtein-table">
+    <title>Levenshtein Functions</title>
+    <tgroup cols="4">
+     <thead>
+      <row>
+       <entry>Function</entry>
+       <entry>Description</entry>
+       <entry>Example</entry>
+       <entry>Example Result</entry>
+      </row>
+     </thead>
+     <tbody>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein</primary>
+         </indexterm>
+         <literal>levenshtein(text source, text target)</literal>
+       </entry>
+       <entry>Returns the distance between the two given strings</entry>
+       <entry><literal>levenshtein('GUMBO', 'GAMBOL')</literal></entry>
+       <entry><literal>2</literal></entry>
+      </row>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein_with_costs</primary>
+         </indexterm>
+         <literal>levenshtein_with_costs(text source, text target, int ins_cost, int del_cost, int sub_cost)</literal>
+       </entry>
+       <entry>Returns the distance between the two given strings depending on costs</entry>
+       <entry><literal>levenshtein('GUMBO', 'GAMBOL', 2, 1, 1)</literal></entry>
+       <entry><literal>3</literal></entry>
+      </row>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein_less_equal</primary>
+         </indexterm>
+         <literal>levenshtein_less_equal(text source, text target, int max_d)</literal>
+       </entry>
+       <entry>Returns the less-equal distance between the two given strings</entry>
+       <entry><literal>levenshtein_less_equal('extensive', 'exhaustive', 2)</literal></entry>
+       <entry><literal>3</literal></entry>
+      </row>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein_less_equal_with_costs</primary>
+         </indexterm>
+         <literal>levenshtein_less_equal_with_costs(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d)</literal>
+       </entry>
+       <entry>Returns the less-equal distance between the two given strings with costs</entry>
+       <entry><literal>levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4)</literal></entry>
+       <entry><literal>4</literal></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    Both <literal>source</literal> and <literal>target</literal> can be any
+    non-null string, with a maximum of 255 bytes.  The cost parameters
+    specify how much to charge for a character insertion, deletion, or
+    substitution, respectively.  You can omit the cost parameters, as in
+    the second version of the function; in that case they all default to 1.
+    <literal>levenshtein_less_equal</literal> is accelerated version of
+    levenshtein function for low values of distance. If actual distance
+    is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
+    returns accurate value of it. Otherwise this function returns value
+    which is greater than max_d.
+   </para>
+ </sect1>
+
  <sect1 id="functions-geometry">
   <title>Geometric Functions and Operators</title>
 
@@ -9686,32 +9768,32 @@ SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Tor
 <screen><![CDATA[
 SET xmloption TO DOCUMENT;
 SELECT xml_is_well_formed('<>');
- xml_is_well_formed 
+ xml_is_well_formed
 --------------------
  f
 (1 row)
 
 SELECT xml_is_well_formed('<abc/>');
- xml_is_well_formed 
+ xml_is_well_formed
 --------------------
  t
 (1 row)
 
 SET xmloption TO CONTENT;
 SELECT xml_is_well_formed('abc');
- xml_is_well_formed 
+ xml_is_well_formed
 --------------------
  t
 (1 row)
 
 SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');
- xml_is_well_formed_document 
+ xml_is_well_formed_document
 -----------------------------
  t
 (1 row)
 
 SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
- xml_is_well_formed_document 
+ xml_is_well_formed_document
 -----------------------------
  f
 (1 row)
@@ -9774,7 +9856,7 @@ SELECT xml_is_well_formed_document('<pg:foo xmlns:pg="http://postgresql.org/stuf
 SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
              ARRAY[ARRAY['my', 'http://example.com']]);
 
- xpath  
+ xpath
 --------
  {test}
 (1 row)
@@ -9817,7 +9899,7 @@ SELECT xpath('//mydefns:b/text()', '<a xmlns="http://example.com"><b>test</b></a
 SELECT xpath_exists('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
                      ARRAY[ARRAY['my', 'http://example.com']]);
 
- xpath_exists  
+ xpath_exists
 --------------
  t
 (1 row)
@@ -14125,7 +14207,7 @@ SELECT current_date + s.a AS dates FROM generate_series(0,14,7) AS s(a);
 
 SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
                               '2008-03-04 12:00', '10 hours');
-   generate_series   
+   generate_series
 ---------------------
  2008-03-01 00:00:00
  2008-03-01 10:00:00
@@ -14188,7 +14270,7 @@ SELECT * FROM generate_series('2008-03-01 00:00'::timestamp,
 <programlisting>
 -- basic usage
 SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s;
- s 
+ s
 ---
  1
  2
@@ -14199,7 +14281,7 @@ SELECT generate_subscripts('{NULL,1,NULL,2}'::int[], 1) AS s;
 -- presenting an array, the subscript and the subscripted
 -- value requires a subquery
 SELECT * FROM arrays;
-         a          
+         a
 --------------------
  {-1,-2}
  {100,200,300}
@@ -14225,7 +14307,7 @@ select $1[i][j]
 $$ LANGUAGE sql IMMUTABLE;
 CREATE FUNCTION
 SELECT * FROM unnest2(ARRAY[[1,2],[3,4]]);
- unnest2 
+ unnest2
 ---------
        1
        2
@@ -15619,13 +15701,13 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
 <programlisting>
 SELECT pg_typeof(33);
 
- pg_typeof 
+ pg_typeof
 -----------
  integer
 (1 row)
 
 SELECT typlen FROM pg_type WHERE oid = pg_typeof(33);
- typlen 
+ typlen
 --------
       4
 (1 row)
@@ -15637,13 +15719,13 @@ SELECT typlen FROM pg_type WHERE oid = pg_typeof(33);
    value that is passed to it.  Example:
 <programlisting>
 SELECT collation for (description) FROM pg_description LIMIT 1;
- pg_collation_for 
+ pg_collation_for
 ------------------
  "default"
 (1 row)
 
 SELECT collation for ('foo' COLLATE "de_DE");
- pg_collation_for 
+ pg_collation_for
 ------------------
  "de_DE"
 (1 row)
@@ -16313,7 +16395,7 @@ postgres=# select pg_start_backup('label_goes_here');
     above functions.  For example:
 <programlisting>
 postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
-        file_name         | file_offset 
+        file_name         | file_offset
 --------------------------+-------------
  00000001000000000000000D |     4039624
 (1 row)
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index f26bd90..f95d5aa 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -83,72 +83,6 @@ SELECT * FROM s WHERE difference(s.nm, 'john') &gt; 2;
  </sect2>
 
  <sect2>
-  <title>Levenshtein</title>
-
-  <para>
-   This function calculates the Levenshtein distance between two strings:
-  </para>
-
-  <indexterm>
-   <primary>levenshtein</primary>
-  </indexterm>
-
-  <indexterm>
-   <primary>levenshtein_less_equal</primary>
-  </indexterm>
-
-<synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
-</synopsis>
-
-  <para>
-   Both <literal>source</literal> and <literal>target</literal> can be any
-   non-null string, with a maximum of 255 bytes.  The cost parameters
-   specify how much to charge for a character insertion, deletion, or
-   substitution, respectively.  You can omit the cost parameters, as in
-   the second version of the function; in that case they all default to 1.
-   <literal>levenshtein_less_equal</literal> is accelerated version of
-   levenshtein function for low values of distance. If actual distance
-   is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
-   returns accurate value of it. Otherwise this function returns value
-   which is greater than max_d.
-  </para>
-
-  <para>
-   Examples:
-  </para>
-
-<screen>
-test=# SELECT levenshtein('GUMBO', 'GAMBOL');
- levenshtein
--------------
-           2
-(1 row)
-
-test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
- levenshtein
--------------
-           3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
- levenshtein_less_equal
-------------------------
-                      3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
- levenshtein_less_equal
-------------------------
-                      4
-(1 row)
-</screen>
- </sect2>
-
- <sect2>
   <title>Metaphone</title>
 
   <para>
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..7071afe 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,8 +22,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	encode.o enum.o float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
 	int8.o json.o jsonb.o jsonb_gin.o jsonb_op.o jsonb_util.o \
-	jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \
-	network.o network_gist.o network_selfuncs.o \
+	jsonfuncs.o levenshtein.o like.o lockfuncs.o mac.o misc.o nabstime.o \
+	name.o network.o network_gist.o network_selfuncs.o \
 	numeric.o numutils.o oid.o oracle_compat.o \
 	orderedsetaggs.o pg_lzcompress.o pg_locale.o pg_lsn.o \
 	pgstatfuncs.o pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..d7c9c68
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,543 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ *	  Levenshtein distance implementation.
+ *
+ * Original author:  Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costsings and a "max" value above which exact distances are not
+ * interesting.  Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm. Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration.  Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/levenshtein.h"
+
+#include "fmgr.h"
+#include "utils/builtins.h"
+
+#include "mb/pg_wchar.h"
+
+#define MAX_LEVENSHTEIN_STRLEN		255
+
+/*
+ * varstr_leven()
+ * varstr_leven_less_equal()
+ * Levenshtein distance functions.  All arguments should be strlen(s) <= 255.
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
+ *
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+	while (len > 0)
+	{
+		len--;
+		if (s1[len] != s2[len])
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Calculates Levenshtein distance metric between supplied csrings, which are
+ * not necessarily null-terminated.  Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t.  The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification.  In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound.  From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row.  This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length.  The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix.  When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+/*
+ * varstr_leven_common
+ *
+ * Common routine for all Levenstein functions.
+ */
+static int
+levenshtein_common(const char *source, int slen, const char *target,
+					int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+	int			m, n;
+	int		   *prev;
+	int		   *curr;
+	int		   *s_char_len = NULL;
+	int			i,
+				j;
+	const char *y;
+	int			max_init;
+	int			start_column,
+				stop_column;
+	int			start_column_local, stop_column_local;
+
+	/* Save value of max_d */
+	max_init = max_d;
+
+	m = pg_mbstrlen_with_len(source, slen);
+	n = pg_mbstrlen_with_len(target, tlen);
+
+	/*
+	 * We can transform an empty s into t with n insertions, or a non-empty t
+	 * into an empty s with m deletions.
+	 */
+	if (!m)
+		return n * ins_c;
+	if (!n)
+		return m * del_c;
+
+	/*
+	 * A common use for Levenshtein distance is to match column names.
+	 * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+	 * guaranteed to work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+	/*
+	 * For security concerns, restrict excessive CPU+RAM usage. (This
+	 * implementation uses O(m) memory and has O(mn) complexity.)
+	 */
+	if (m > MAX_LEVENSHTEIN_STRLEN ||
+		n > MAX_LEVENSHTEIN_STRLEN)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("argument exceeds the maximum length of %d bytes",
+						MAX_LEVENSHTEIN_STRLEN)));
+
+	/*
+	 * XXX: This is the beginning of the first loop originally defined with
+	 * LEVENSHTEIN_LESS_EQUAL
+	 */
+	if (max_init >= 0)
+	{
+		/* Initialize start and stop columns. */
+		start_column = 0;
+		stop_column = m + 1;
+
+		/*
+		 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
+		 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
+		 * enough to limit the computation we must perform.  If so, figure out
+		 * initial stop column.
+		 */
+		if (max_d >= 0)
+		{
+			int			min_theo_d; /* Theoretical minimum distance. */
+			int			max_theo_d; /* Theoretical maximum distance. */
+			int			net_inserts = n - m;
+
+			min_theo_d = net_inserts < 0 ?
+				-net_inserts * del_c : net_inserts * ins_c;
+			if (min_theo_d > max_d)
+				return max_d + 1;
+			if (ins_c + del_c < sub_c)
+				sub_c = ins_c + del_c;
+			max_theo_d = min_theo_d + sub_c * Min(m, n);
+			if (max_d >= max_theo_d)
+				max_d = -1;
+			else if (ins_c + del_c > 0)
+			{
+				/*
+				 * Figure out how much of the first row of the notional matrix we
+				 * need to fill in.  If the string is growing, the theoretical
+				 * minimum distance already incorporates the cost of deleting the
+				 * number of characters necessary to make the two strings equal in
+				 * length.  Each additional deletion forces another insertion, so
+				 * the best-case total cost increases by ins_c + del_c. If the
+				 * string is shrinking, the minimum theoretical cost assumes no
+				 * excess deletions; that is, we're starting no further right than
+				 * column n - m.  If we do start further right, the best-case
+				 * total cost increases by ins_c + del_c for each move right.
+				 */
+				int			slack_d = max_d - min_theo_d;
+				int			best_column = net_inserts < 0 ? -net_inserts : 0;
+
+				stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+				if (stop_column > m)
+					stop_column = m + 1;
+			}
+		}
+	}
+	else
+	{
+		/*
+		 * Be sure to set if correctly stop and start columns in all cases.
+		 */
+		start_column = 0;
+		stop_column = m;
+	}
+
+	/*
+	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
+	 * we cache all the lengths before starting the main loop -- but if all
+	 * the characters in both strings are single byte, then we skip this and
+	 * use a fast-path in the main loop.  If only one string contains
+	 * multi-byte characters, we still build the array, so that the fast-path
+	 * needn't deal with the case where the array hasn't been initialized.
+	 */
+	if (m != slen || n != tlen)
+	{
+		int			i;
+		const char *cp = source;
+
+		s_char_len = (int *) palloc((m + 1) * sizeof(int));
+		for (i = 0; i < m; ++i)
+		{
+			s_char_len[i] = pg_mblen(cp);
+			cp += s_char_len[i];
+		}
+		s_char_len[i] = 0;
+	}
+
+	/* One more cell for initialization column and row. */
+	++m;
+	++n;
+
+	/* Previous and current rows of notional array. */
+	prev = (int *) palloc(2 * m * sizeof(int));
+	curr = prev + m;
+
+	/*
+	 * To transform the first i characters of s into the first 0 characters of
+	 * t, we must perform i deletions.
+	 */
+	if (max_init >= 0)
+		stop_column_local = stop_column;
+	else
+		stop_column_local = m;
+
+	for (i = 0; i < stop_column_local; i++)
+		prev[i] = i * del_c;
+
+	/* Loop through rows of the notional array */
+	for (y = target, j = 1; j < n; j++)
+	{
+		int		   *temp;
+		const char *x = source;
+		int			y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+		/*
+		 * XXX: This is the second loop originally defined with
+		 * LEVENSHTEIN_LESS_EQUAL
+		 */
+		if (max_init >= 0)
+		{
+			/*
+			 * In the best case, values percolate down the diagonal unchanged, so
+			 * we must increment stop_column unless it's already on the right end
+			 * of the array.  The inner loop will read prev[stop_column], so we
+			 * have to initialize it even though it shouldn't affect the result.
+			 */
+			if (stop_column < m)
+			{
+				prev[stop_column] = max_d + 1;
+				++stop_column;
+			}
+
+			/*
+			 * The main loop fills in curr, but curr[0] needs a special case: to
+			 * transform the first 0 characters of s into the first j characters
+			 * of t, we must perform j insertions.  However, if start_column > 0,
+			 * this special case does not apply.
+			 */
+			if (start_column == 0)
+			{
+				curr[0] = j * ins_c;
+				i = 1;
+			}
+			else
+				i = start_column;
+		}
+		else
+		{
+			curr[0] = j * ins_c;
+			i = 1;
+		}
+
+		/*
+		 * This inner loop is critical to performance, so we include a
+		 * fast-path to handle the (fairly common) case where no multibyte
+		 * characters are in the mix.  The fast-path is entitled to assume
+		 * that if s_char_len is not initialized then BOTH strings contain
+		 * only single-byte characters.
+		 */
+		if (s_char_len != NULL)
+		{
+			if (max_init < 0)
+				stop_column_local = m;
+			else
+				stop_column_local = stop_column;
+
+			for (; i < stop_column_local; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+				int			x_char_len = s_char_len[i - 1];
+
+				/*
+				 * Calculate costs for insertion, deletion, and substitution.
+				 *
+				 * When calculating cost for substitution, we compare the last
+				 * character of each possibly-multibyte character first,
+				 * because that's enough to rule out most mis-matches.  If we
+				 * get past that test, then we compare the lengths and the
+				 * remaining bytes.
+				 */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				if (x[x_char_len - 1] == y[y_char_len - 1]
+					&& x_char_len == y_char_len &&
+					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+					sub = prev[i - 1];
+				else
+					sub = prev[i - 1] + sub_c;
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x += x_char_len;
+			}
+		}
+		else
+		{
+			if (max_init < 0)
+				stop_column_local = m;
+			else
+				stop_column_local = stop_column;
+
+			for (; i < stop_column_local; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+
+				/* Calculate costs for insertion, deletion, and substitution. */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x++;
+			}
+		}
+
+		/* Swap current row with previous row. */
+		temp = curr;
+		curr = prev;
+		prev = temp;
+
+		/* Point to next character. */
+		y += y_char_len;
+
+		/*
+		 * This chunk of code represents a significant performance hit if used
+		 * in the case where there is no max_d bound.  This is probably not
+		 * because the max_d >= 0 test itself is expensive, but rather because
+		 * the possibility of needing to execute this code prevents tight
+		 * optimization of the loop as a whole.
+		 */
+		if (max_init >= 0 && max_d >= 0)
+		{
+			/*
+			 * The "zero point" is the column of the current row where the
+			 * remaining portions of the strings are of equal length.  There
+			 * are (n - 1) characters in the target string, of which j have
+			 * been transformed.  There are (m - 1) characters in the source
+			 * string, so we want to find the value for zp where (n - 1) - j =
+			 * (m - 1) - zp.
+			 */
+			int			zp = j - (n - m);
+
+			/* Check whether the stop column can slide left. */
+			while (stop_column > 0)
+			{
+				int			ii = stop_column - 1;
+				int			net_inserts = ii - zp;
+
+				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+								-net_inserts * del_c) <= max_d)
+					break;
+				stop_column--;
+			}
+
+			/* Check whether the start column can slide right. */
+			while (start_column < stop_column)
+			{
+				int			net_inserts = start_column - zp;
+
+				if (prev[start_column] +
+					(net_inserts > 0 ? net_inserts * ins_c :
+					 -net_inserts * del_c) <= max_d)
+					break;
+
+				/*
+				 * We'll never again update these values, so we must make sure
+				 * there's nothing here that could confuse any future
+				 * iteration of the outer loop.
+				 */
+				prev[start_column] = max_d + 1;
+				curr[start_column] = max_d + 1;
+				if (start_column != 0)
+					source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+				start_column++;
+			}
+
+			/* If they cross, we're going to exceed the bound. */
+			if (start_column >= stop_column)
+				return max_d + 1;
+		}
+	}
+
+	/*
+	 * Because the final value was swapped from the previous row to the
+	 * current row, that's where we'll find it.
+	 */
+	return prev[m - 1];
+}
+
+int
+levenshtein_less_equal_internal(const char *source, int slen, const char *target,
+						int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+	return levenshtein_common(source, slen, target, tlen, ins_c,
+						del_c, sub_c, max_d);
+}
+
+int
+levenshtein_internal(const char *source, int slen, const char *target, int tlen,
+			 int ins_c, int del_c, int sub_c)
+{
+	return levenshtein_common(source, slen, target, tlen, ins_c,
+						del_c, sub_c, -1);
+}
+
+Datum
+levenshtein(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+			t_bytes, 1, 1, 1));
+}
+
+Datum
+levenshtein_with_costs(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	int			ins_c = PG_GETARG_INT32(2);
+	int			del_c = PG_GETARG_INT32(3);
+	int			sub_c = PG_GETARG_INT32(4);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+			t_bytes, ins_c, del_c, sub_c));
+}
+
+Datum
+levenshtein_less_equal(PG_FUNCTION_ARGS)
+{
+	text		*src = PG_GETARG_TEXT_PP(0);
+	text		*dst = PG_GETARG_TEXT_PP(1);
+	int			 max_d = PG_GETARG_INT32(2);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+			t_data, t_bytes, 1, 1, 1, max_d));
+}
+
+Datum
+levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	int			ins_c = PG_GETARG_INT32(2);
+	int			del_c = PG_GETARG_INT32(3);
+	int			sub_c = PG_GETARG_INT32(4);
+	int			max_d = PG_GETARG_INT32(5);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+			t_data, t_bytes, ins_c, del_c, sub_c, max_d));
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..bcc13cb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4973,6 +4973,16 @@ DESCR("peek at changes from replication slot");
 DATA(insert OID = 3785 (  pg_logical_slot_peek_binary_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,17}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ pg_logical_slot_peek_binary_changes _null_ _null_ _null_ ));
 DESCR("peek at binary changes from replication slot");
 
+/* levenshtein distance */
+DATA(insert OID = 3366 ( levenshtein	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "25 25" _null_ _null_ _null_ _null_ levenshtein _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings");
+DATA(insert OID = 3367 ( levenshtein_with_costs	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 23 "25 25 23 23 23" _null_ _null_ _null_ _null_ levenshtein_with_costs _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings with costs");
+DATA(insert OID = 3368 ( levenshtein_less_equal	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 23 "25 25 23" _null_ _null_ _null_ _null_ levenshtein_less_equal _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings");
+DATA(insert OID = 3369 ( levenshtein_less_equal_with_costs PGNSP PGUID 12 1 0 0 0 f f f f t f i 6 0 23 "25 25 23 23 23 23" _null_ _null_ _null_ _null_ levenshtein_less_equal_with_costs _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings with costs");
+
 /* event triggers */
 DATA(insert OID = 3566 (  pg_event_trigger_dropped_objects		PGNSP PGUID 12 10 100 0 0 f f f f t t s 0 0 2249 "" "{26,26,23,25,25,25,25}" "{o,o,o,o,o,o,o}" "{classid, objid, objsubid, object_type, schema_name, object_name, object_identity}" _null_ pg_event_trigger_dropped_objects _null_ _null_ _null_ ));
 DESCR("list objects dropped by the current command");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index bbb5d39..b468c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -851,6 +851,12 @@ extern Datum cidrecv(PG_FUNCTION_ARGS);
 extern Datum cidsend(PG_FUNCTION_ARGS);
 extern Datum cideq(PG_FUNCTION_ARGS);
 
+/* levenshtein.c */
+extern Datum levenshtein(PG_FUNCTION_ARGS);
+extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+
 /* like.c */
 extern Datum namelike(PG_FUNCTION_ARGS);
 extern Datum namenlike(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/levenshtein.h b/src/include/utils/levenshtein.h
new file mode 100644
index 0000000..65829e2
--- /dev/null
+++ b/src/include/utils/levenshtein.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * src/include/utils/levenshtein.h
+ *	  Header file for the Levenshtein distance functions, internal
+ *	  and system functions.
+ *
+ * Copyright (c) 2007-2014, PostgreSQL Global Development Group
+ *
+ * src/include/utils/levenshtein.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+#include "postgres.h"
+
+/* Internal functions */
+extern int levenshtein_less_equal_internal(const char *source, int slen,
+								  const char *target, int tlen,
+								  int ins_c, int del_c, int sub_c, int max_d);
+
+extern int levenshtein_internal(const char *source, int slen,
+					   const char *target, int tlen,
+					   int ins_c, int del_c, int sub_c);
+
+#endif
diff --git a/src/test/regress/expected/levenshtein.out b/src/test/regress/expected/levenshtein.out
new file mode 100644
index 0000000..57fd083
--- /dev/null
+++ b/src/test/regress/expected/levenshtein.out
@@ -0,0 +1,27 @@
+--
+-- LEVENSHTEIN
+--
+SELECT levenshtein('GUMBO', 'GAMBOL');
+ levenshtein 
+-------------
+           2
+(1 row)
+
+SELECT levenshtein_with_costs('GUMBO', 'GAMBOL', 2, 1, 1);
+ levenshtein_with_costs 
+------------------------
+                      3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+ levenshtein_less_equal 
+------------------------
+                      3
+(1 row)
+
+SELECT levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4);
+ levenshtein_less_equal_with_costs 
+-----------------------------------
+                                 4
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c0416f4..5faf182 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: tablespace
 # ----------
 # The first group of parallel tests
 # ----------
-test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
+test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc levenshtein
 
 # Depends on things setup during char, varchar and text
 test: strings
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 16a1905..e980619 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -21,6 +21,7 @@ test: money
 test: rangetypes
 test: pg_lsn
 test: regproc
+test: levenshtein
 test: strings
 test: numerology
 test: point
diff --git a/src/test/regress/sql/levenshtein.sql b/src/test/regress/sql/levenshtein.sql
new file mode 100644
index 0000000..ea69a37
--- /dev/null
+++ b/src/test/regress/sql/levenshtein.sql
@@ -0,0 +1,8 @@
+--
+-- LEVENSHTEIN
+--
+
+SELECT levenshtein('GUMBO', 'GAMBOL');
+SELECT levenshtein_with_costs('GUMBO', 'GAMBOL', 2, 1, 1);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+SELECT levenshtein_less_equal_with_costs('extensive', 'exhaustive', 1, 1, 1, 4);
-- 
2.0.1

0002-Support-for-column-hints.patchtext/x-diff; charset=US-ASCII; name=0002-Support-for-column-hints.patchDownload
From 4d0d46bd57b4f4f3a962f5b27634a6174ca2acde Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:58:25 +0900
Subject: [PATCH 2/2] Support for column hints

If incorrect column names are written in a query, system tries to
evaluate if there are columns on existing RTEs that are close in
distance to the one mistaken, and returns to user hints according
to the evaluation done.
---
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 318 ++++++++++++++++++++++++++----
 src/include/parser/parse_relation.h       |   3 +-
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out        |  39 ++++
 src/test/regress/expected/plpgsql.out     |   1 +
 src/test/regress/expected/rowtypes.out    |   1 +
 src/test/regress/expected/rules.out       |   1 +
 src/test/regress/expected/without_oid.out |   1 +
 src/test/regress/sql/join.sql             |  24 +++
 11 files changed, 366 insertions(+), 41 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..2838f89 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -28,6 +29,7 @@
 #include "parser/parse_relation.h"
 #include "parser/parse_type.h"
 #include "utils/builtins.h"
+#include "utils/levenshtein.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -520,6 +522,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * distanceName
+ *	  Return Levenshtein distance between an actual column name and possible
+ *	  partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+	int len = strlen(actual),
+		match_len = strlen(match);
+
+	/* Charge half as much per deletion as per insertion or per substitution */
+	return levenshtein_less_equal_internal(actual, len, match, match_len,
+								   2, 1, 2, max);
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +545,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance.  Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match).  This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful.  Note that we never consider system column names
+ * when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, AttrNumber *closest, int *distance)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +580,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +602,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		if (distance && *distance != 0)
+		{
+			if (result)
+			{
+				/* Exact match just found */
+				*distance = 0;
+			}
+			else
+			{
+				int lowestdistance = *distance;
+				int thisdistance = distanceName(attcolname, colname,
+												lowestdistance);
+
+				if (thisdistance >= lowestdistance)
+				{
+					/*
+					 * This match distance may equal a prior match within this
+					 * same range table.  When that happens, the prior match is
+					 * discarded as worthless, since a single best match is
+					 * required within a RTE.
+					 */
+					if (thisdistance == lowestdistance)
+						*closest = InvalidAttrNumber;
+
+					continue;
+				}
+
+				/* Store new lowest observed distance for RT */
+				*distance = thisdistance;
+			}
+			*closest = attnum;
+		}
 	}
 
 	/*
@@ -642,7 +711,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL, NULL);
 
 			if (newresult)
 			{
@@ -668,8 +738,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns a list of equally likely
+ *	  candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily; we provide the closet columns if there
+ * was not an exact match.  Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +754,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location, AttrNumber closest[2])
 {
-	ParseState *orig_pstate = pstate;
+	ParseState	   *orig_pstate = pstate;
+	int				distance = INT_MAX;
+	List		   *matchedrte = NIL;
+	ListCell	   *l;
+	int				i;
 
 	while (pstate != NULL)
 	{
-		ListCell   *l;
-
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+			AttrNumber		rteclosest = InvalidAttrNumber;
+			int				rtdistance = INT_MAX;
+			bool			wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Get single best match from each RTE, or no match for RTE if
+			 * there is a tie for best match within a given RTE
+			 */
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+							 &rtdistance);
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtdistance == 0)
+			{
+				/* Exact match (for "wrong alias" or "wrong level" cases) */
+				closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+				/*
+				 * Any exact match is always the uncontested best match.  It
+				 * doesn't seem worth considering the case where there are
+				 * multiple exact matches, so we're done.
+				 */
+				matchedrte = lappend(NIL, rte);
+				return matchedrte;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtdistance += 3;
+
+			if (rteclosest != InvalidAttrNumber)
+			{
+				if (rtdistance >= distance)
+				{
+					/*
+					 * Perhaps record this attribute as being just as close in
+					 * distance to closest attribute observed so far across
+					 * entire range table.  Iff this distance is ultimately the
+					 * lowest distance observed overall, it may end up as the
+					 * second match.
+					 */
+					if (rtdistance == distance)
+					{
+						closest[1] = rteclosest;
+						matchedrte = lappend(matchedrte, rte);
+					}
+
+					continue;
+				}
+
+				/*
+				 * One best match (better than any others in previous RTEs) was
+				 * found within this RTE
+				 */
+				distance = rtdistance;
+				/* New uncontested best match */
+				matchedrte = lappend(NIL, rte);
+				closest[0] = rteclosest;
+			}
+			else
+			{
+				/*
+				 * Even though there were perhaps multiple joint-best matches
+				 * within this RTE (implying that there can be no attribute
+				 * suggestion from it), the shortest distance should still
+				 * serve as the distance for later RTEs to beat (but naturally
+				 * only if it happens to be the lowest so far across the entire
+				 * range table).
+				 */
+				distance = Min(distance, rtdistance);
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Too many equally close partial matches found?
+	 *
+	 * It's useful to provide two matches for the common case where two range
+	 * tables each have one equally distant candidate column, as when an
+	 * unqualified (and therefore would-be ambiguous) column name is specified
+	 * which is also misspelled by the user.  It seems unhelpful to show no
+	 * hint when this occurs, since in practice one attribute probably
+	 * references the other in a foreign key relationship.  However, when there
+	 * are more than 2 range tables with equally distant matches that's
+	 * probably because the matches are not useful, so don't suggest anything.
+	 */
+	if (list_length(matchedrte) > 2)
+		return NIL;
+
+	/*
+	 * Handle dropped columns, which can appear here as empty colnames per
+	 * remarks within scanRTEForColumn().  If either the first or second
+	 * suggested attributes are dropped, do not provide any suggestion.
+	 */
+	i = 0;
+	foreach(l, matchedrte)
+	{
+		RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+		char		   *closestcol;
+
+		closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+		if (strcmp(closestcol, "") == 0)
+			return NIL;
+	}
+
+	/*
+	 * Distance must be less than a normalized threshold in order to avoid
+	 * completely ludicrous suggestions.  Note that a distance of 6 will be
+	 * seen when 6 deletions are required against actual attribute name, or 3
+	 * insertions/substitutions.
+	 */
+	if (distance > 6 && distance > strlen(colname) * 2 / 2)
+		return NIL;
+
+	return matchedrte;
 }
 
 /*
@@ -2855,41 +3050,92 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
 /*
  * Generate a suitable error about a missing column.
  *
- * Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * Since this is a very common type of error, we work rather hard to produce a
+ * helpful message, going so far as to guess user's intent when a missing
+ * column name is probably intended to reference one of two would-be ambiguous
+ * attributes (when no alias/qualification was provided).
  */
 void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	List		   *matchedrte;
+	AttrNumber	    closest[2];
+	RangeTblEntry  *rte1 = NULL,
+				   *rte2 = NULL;
+	char		   *closestcol1;
+	char		   *closestcol2;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+	 * the event of an exact match there is only ever one suggestion
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	closest[0] = closest[1] = InvalidAttrNumber;
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
 	 *
 	 * TODO: improve this code (and also errorMissingRTE) to mention using
 	 * LATERAL if appropriate.
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+										closest);
+
+	/*
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
+	 *
+	 * Extract RTEs for best match, if any, and joint best match, if any.
+	 */
+	if (matchedrte)
+	{
+		rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+		if (list_length(matchedrte) > 1)
+			rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+		if (rte1 && closest[0] != InvalidAttrNumber)
+			closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+		if (rte2 && closest[1] != InvalidAttrNumber)
+			closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+	}
+
+	if (!rte2)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 *
+		 * Infer an exact match referenced despite not being visible from the
+		 * fact that an attribute number was not passed back.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 rte1? closest[0] != InvalidAttrNumber?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, rte1->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Handle case where there are two equally useful column hints */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1,
+						 rte2->eref->aliasname, closestcol2),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, AttrNumber *matchedatt,
+				 int *distance);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 9b89e58..77829dc 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altstartwith;
                ^
+HINT:  Perhaps you meant to reference the column "altstartwith"."col".
 select * from altstartwith;
  col 
 -----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altinhoid;
                ^
+HINT:  Perhaps you meant to reference the column "altinhoid"."col".
 select * from altwithoid;
  col 
 -----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 1cb1c51..f4edcbe 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3388,6 +3394,39 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 8892bb4..2cb4aa1 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4771,6 +4771,7 @@ END$$;
 ERROR:  column "foo" does not exist
 LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
                                         ^
+HINT:  Perhaps you meant to reference the column "room"."roomno".
 QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
 CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
 -- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname;  -- error
 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
                ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".
 -- same, but RECORD instead of named composite type:
 select cast (row('Jim', 'Beam') as text);
     row     
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..48c75fd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2368,6 +2368,7 @@ select xmin, * from fooview;  -- fail, views don't have such a column
 ERROR:  column "xmin" does not exist
 LINE 1: select xmin, * from fooview;
                ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".
 select reltoastrelid, relkind, relfrozenxid
   from pg_class where oid = 'fooview'::regclass;
  reltoastrelid | relkind | relfrozenxid 
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
 ERROR:  column "oid" does not exist
 LINE 1: SELECT count(oid) FROM wo;
                      ^
+HINT:  Perhaps you meant to reference the column "wo"."i".
 VACUUM ANALYZE wi;
 VACUUM ANALYZE wo;
 SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index fa3e068..4d60f9e 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1047,6 +1051,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
2.0.1

#65Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#64)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

On Thu, Jul 17, 2014 at 6:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Patch 2 is a rebase of the feature of Peter that can be applied on top of
patch 1. The code is rather untouched (haven't much played with Peter's
thingies), well-commented, but I think that this needs more work,
particularly when a query has a single RTE like in this case where no hints
are proposed to the user (mentioned upthread):

The only source of disagreement that I am aware of at this point is
the question of whether or not we should accept two candidates from
the same RTE. I lean slightly towards "no", as already explained [1]/messages/by-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com
[2]: /messages/by-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com -- Peter Geoghegan
approach of looking for only a single best candidate per RTE taken in
deference to the concerns of others.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

[1]: /messages/by-id/CAM3SWZTrm4PmqMmL9=eYx-8f-Vx-ha7DmE4KOmS2vCOMOzGHrw@mail.gmail.com
[2]: /messages/by-id/CAM3SWZS6kiQEqJz4pV3Fkp6cgw1wS26exOQTjb_XMW3zE5b6mA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#65)
2 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

Fine for me. I'll move this patch to the next stage then.
--
Michael

Attachments:

0001-Move-Levenshtein-functions-to-core.patchtext/x-diff; charset=US-ASCII; name=0001-Move-Levenshtein-functions-to-core.patchDownload
From 65b66309444767129c81c6aee8df33a214bcf4c5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Fri, 18 Jul 2014 16:44:28 +0900
Subject: [PATCH 1/2] Move Levenshtein functions to core

All the functions, part of fuzzystrmatch, able to evaluate distances
between strings are moved into core:
- levenshtein
- levenshtein_less_equal
In order to unify the names of the functions in catalogs, the functions
with costs are appended a prefix *_with_costs.

Documentation, as well as regression tests are added. fuzzystrmatch is
dumped to 1.1 at the same occasion.
---
 contrib/fuzzystrmatch/Makefile                    |   6 +-
 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql |   9 +
 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql      |  44 --
 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql      |  28 ++
 contrib/fuzzystrmatch/fuzzystrmatch.c             |  69 ---
 contrib/fuzzystrmatch/fuzzystrmatch.control       |   2 +-
 contrib/fuzzystrmatch/levenshtein.c               | 403 ---------------
 doc/src/sgml/func.sgml                            |  68 +++
 doc/src/sgml/fuzzystrmatch.sgml                   |  66 ---
 src/backend/utils/adt/Makefile                    |   4 +-
 src/backend/utils/adt/levenshtein.c               | 565 ++++++++++++++++++++++
 src/include/catalog/pg_proc.h                     |  10 +
 src/include/utils/builtins.h                      |   6 +
 src/include/utils/levenshtein.h                   |  27 ++
 src/test/regress/expected/levenshtein.out         |  27 ++
 src/test/regress/parallel_schedule                |   2 +-
 src/test/regress/serial_schedule                  |   1 +
 src/test/regress/sql/levenshtein.sql              |   8 +
 18 files changed, 755 insertions(+), 590 deletions(-)
 create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
 delete mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
 create mode 100644 contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c
 create mode 100644 src/include/utils/levenshtein.h
 create mode 100644 src/test/regress/expected/levenshtein.out
 create mode 100644 src/test/regress/sql/levenshtein.sql

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..3d3c773 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -4,7 +4,8 @@ MODULE_big = fuzzystrmatch
 OBJS = fuzzystrmatch.o dmetaphone.o $(WIN32RES)
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql
+DATA =	fuzzystrmatch--1.0.sql fuzzystrmatch--unpackaged--1.0.sql \
+	fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 ifdef USE_PGXS
@@ -17,6 +18,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
new file mode 100644
index 0000000..0fca2a6
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.0--1.1.sql
@@ -0,0 +1,9 @@
+/* contrib/pageinspect/fuzzystrmatch--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO 1.1" to load this file. \quit
+
+DROP FUNCTION levenshtein (text,text);
+DROP FUNCTION levenshtein (text,text,int,int,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int);
+DROP FUNCTION levenshtein_less_equal (text,text,int,int,int,int);
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
deleted file mode 100644
index 1cf9b61..0000000
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql
+++ /dev/null
@@ -1,44 +0,0 @@
-/* contrib/fuzzystrmatch/fuzzystrmatch--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
-
-CREATE FUNCTION levenshtein (text,text) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein (text,text,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION levenshtein_less_equal (text,text,int,int,int,int) RETURNS int
-AS 'MODULE_PATHNAME','levenshtein_less_equal_with_costs'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION metaphone (text,int) RETURNS text
-AS 'MODULE_PATHNAME','metaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION text_soundex(text) RETURNS text
-AS 'MODULE_PATHNAME', 'soundex'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION difference(text,text) RETURNS int
-AS 'MODULE_PATHNAME', 'difference'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION dmetaphone_alt (text) RETURNS text
-AS 'MODULE_PATHNAME', 'dmetaphone_alt'
-LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
new file mode 100644
index 0000000..a4861ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION fuzzystrmatch" to load this file. \quit
+
+CREATE FUNCTION metaphone (text,int) RETURNS text
+AS 'MODULE_PATHNAME','metaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION text_soundex(text) RETURNS text
+AS 'MODULE_PATHNAME', 'soundex'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION difference(text,text) RETURNS int
+AS 'MODULE_PATHNAME', 'difference'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE FUNCTION dmetaphone_alt (text) RETURNS text
+AS 'MODULE_PATHNAME', 'dmetaphone_alt'
+LANGUAGE C IMMUTABLE STRICT;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..9923c17 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -40,7 +40,6 @@
 
 #include <ctype.h>
 
-#include "mb/pg_wchar.h"
 #include "utils/builtins.h"
 
 PG_MODULE_MAGIC;
@@ -154,74 +153,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len > 0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
-PG_FUNCTION_INFO_V1(levenshtein_with_costs);
-Datum
-levenshtein_with_costs(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			ins_c = PG_GETARG_INT32(2);
-	int			del_c = PG_GETARG_INT32(3);
-	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein);
-Datum
-levenshtein(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal_with_costs);
-Datum
-levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			ins_c = PG_GETARG_INT32(2);
-	int			del_c = PG_GETARG_INT32(3);
-	int			sub_c = PG_GETARG_INT32(4);
-	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
-}
-
-
-PG_FUNCTION_INFO_V1(levenshtein_less_equal);
-Datum
-levenshtein_less_equal(PG_FUNCTION_ARGS)
-{
-	text	   *src = PG_GETARG_TEXT_PP(0);
-	text	   *dst = PG_GETARG_TEXT_PP(1);
-	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
-}
-
 
 /*
  * Calculates the metaphone of an input string.
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index e257f09..6b2832a 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,5 +1,5 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN		255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t.  The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification.  In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound.  From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row.  This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length.  The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix.  When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c)
-#endif
-{
-	int			m,
-				n,
-				s_bytes,
-				t_bytes;
-	int		   *prev;
-	int		   *curr;
-	int		   *s_char_len = NULL;
-	int			i,
-				j;
-	const char *s_data;
-	const char *t_data;
-	const char *y;
-
-	/*
-	 * For levenshtein_less_equal_internal, we have real variables called
-	 * start_column and stop_column; otherwise it's just short-hand for 0 and
-	 * m.
-	 */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	int			start_column,
-				stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
-	/* Extract a pointer to the actual character data. */
-	s_data = VARDATA_ANY(s);
-	t_data = VARDATA_ANY(t);
-
-	/* Determine length of each string in bytes and characters. */
-	s_bytes = VARSIZE_ANY_EXHDR(s);
-	t_bytes = VARSIZE_ANY_EXHDR(t);
-	m = pg_mbstrlen_with_len(s_data, s_bytes);
-	n = pg_mbstrlen_with_len(t_data, t_bytes);
-
-	/*
-	 * We can transform an empty s into t with n insertions, or a non-empty t
-	 * into an empty s with m deletions.
-	 */
-	if (!m)
-		return n * ins_c;
-	if (!n)
-		return m * del_c;
-
-	/*
-	 * For security concerns, restrict excessive CPU+RAM usage. (This
-	 * implementation uses O(m) memory and has O(mn) complexity.)
-	 */
-	if (m > MAX_LEVENSHTEIN_STRLEN ||
-		n > MAX_LEVENSHTEIN_STRLEN)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("argument exceeds the maximum length of %d bytes",
-						MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	/* Initialize start and stop columns. */
-	start_column = 0;
-	stop_column = m + 1;
-
-	/*
-	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
-	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
-	 * enough to limit the computation we must perform.  If so, figure out
-	 * initial stop column.
-	 */
-	if (max_d >= 0)
-	{
-		int			min_theo_d; /* Theoretical minimum distance. */
-		int			max_theo_d; /* Theoretical maximum distance. */
-		int			net_inserts = n - m;
-
-		min_theo_d = net_inserts < 0 ?
-			-net_inserts * del_c : net_inserts * ins_c;
-		if (min_theo_d > max_d)
-			return max_d + 1;
-		if (ins_c + del_c < sub_c)
-			sub_c = ins_c + del_c;
-		max_theo_d = min_theo_d + sub_c * Min(m, n);
-		if (max_d >= max_theo_d)
-			max_d = -1;
-		else if (ins_c + del_c > 0)
-		{
-			/*
-			 * Figure out how much of the first row of the notional matrix we
-			 * need to fill in.  If the string is growing, the theoretical
-			 * minimum distance already incorporates the cost of deleting the
-			 * number of characters necessary to make the two strings equal in
-			 * length.  Each additional deletion forces another insertion, so
-			 * the best-case total cost increases by ins_c + del_c. If the
-			 * string is shrinking, the minimum theoretical cost assumes no
-			 * excess deletions; that is, we're starting no further right than
-			 * column n - m.  If we do start further right, the best-case
-			 * total cost increases by ins_c + del_c for each move right.
-			 */
-			int			slack_d = max_d - min_theo_d;
-			int			best_column = net_inserts < 0 ? -net_inserts : 0;
-
-			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
-			if (stop_column > m)
-				stop_column = m + 1;
-		}
-	}
-#endif
-
-	/*
-	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
-	 * we cache all the lengths before starting the main loop -- but if all
-	 * the characters in both strings are single byte, then we skip this and
-	 * use a fast-path in the main loop.  If only one string contains
-	 * multi-byte characters, we still build the array, so that the fast-path
-	 * needn't deal with the case where the array hasn't been initialized.
-	 */
-	if (m != s_bytes || n != t_bytes)
-	{
-		int			i;
-		const char *cp = s_data;
-
-		s_char_len = (int *) palloc((m + 1) * sizeof(int));
-		for (i = 0; i < m; ++i)
-		{
-			s_char_len[i] = pg_mblen(cp);
-			cp += s_char_len[i];
-		}
-		s_char_len[i] = 0;
-	}
-
-	/* One more cell for initialization column and row. */
-	++m;
-	++n;
-
-	/* Previous and current rows of notional array. */
-	prev = (int *) palloc(2 * m * sizeof(int));
-	curr = prev + m;
-
-	/*
-	 * To transform the first i characters of s into the first 0 characters of
-	 * t, we must perform i deletions.
-	 */
-	for (i = START_COLUMN; i < STOP_COLUMN; i++)
-		prev[i] = i * del_c;
-
-	/* Loop through rows of the notional array */
-	for (y = t_data, j = 1; j < n; j++)
-	{
-		int		   *temp;
-		const char *x = s_data;
-		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * In the best case, values percolate down the diagonal unchanged, so
-		 * we must increment stop_column unless it's already on the right end
-		 * of the array.  The inner loop will read prev[stop_column], so we
-		 * have to initialize it even though it shouldn't affect the result.
-		 */
-		if (stop_column < m)
-		{
-			prev[stop_column] = max_d + 1;
-			++stop_column;
-		}
-
-		/*
-		 * The main loop fills in curr, but curr[0] needs a special case: to
-		 * transform the first 0 characters of s into the first j characters
-		 * of t, we must perform j insertions.  However, if start_column > 0,
-		 * this special case does not apply.
-		 */
-		if (start_column == 0)
-		{
-			curr[0] = j * ins_c;
-			i = 1;
-		}
-		else
-			i = start_column;
-#else
-		curr[0] = j * ins_c;
-		i = 1;
-#endif
-
-		/*
-		 * This inner loop is critical to performance, so we include a
-		 * fast-path to handle the (fairly common) case where no multibyte
-		 * characters are in the mix.  The fast-path is entitled to assume
-		 * that if s_char_len is not initialized then BOTH strings contain
-		 * only single-byte characters.
-		 */
-		if (s_char_len != NULL)
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-				int			x_char_len = s_char_len[i - 1];
-
-				/*
-				 * Calculate costs for insertion, deletion, and substitution.
-				 *
-				 * When calculating cost for substitution, we compare the last
-				 * character of each possibly-multibyte character first,
-				 * because that's enough to rule out most mis-matches.  If we
-				 * get past that test, then we compare the lengths and the
-				 * remaining bytes.
-				 */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				if (x[x_char_len - 1] == y[y_char_len - 1]
-					&& x_char_len == y_char_len &&
-					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
-					sub = prev[i - 1];
-				else
-					sub = prev[i - 1] + sub_c;
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x += x_char_len;
-			}
-		}
-		else
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-
-				/* Calculate costs for insertion, deletion, and substitution. */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x++;
-			}
-		}
-
-		/* Swap current row with previous row. */
-		temp = curr;
-		curr = prev;
-		prev = temp;
-
-		/* Point to next character. */
-		y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * This chunk of code represents a significant performance hit if used
-		 * in the case where there is no max_d bound.  This is probably not
-		 * because the max_d >= 0 test itself is expensive, but rather because
-		 * the possibility of needing to execute this code prevents tight
-		 * optimization of the loop as a whole.
-		 */
-		if (max_d >= 0)
-		{
-			/*
-			 * The "zero point" is the column of the current row where the
-			 * remaining portions of the strings are of equal length.  There
-			 * are (n - 1) characters in the target string, of which j have
-			 * been transformed.  There are (m - 1) characters in the source
-			 * string, so we want to find the value for zp where (n - 1) - j =
-			 * (m - 1) - zp.
-			 */
-			int			zp = j - (n - m);
-
-			/* Check whether the stop column can slide left. */
-			while (stop_column > 0)
-			{
-				int			ii = stop_column - 1;
-				int			net_inserts = ii - zp;
-
-				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
-								-net_inserts * del_c) <= max_d)
-					break;
-				stop_column--;
-			}
-
-			/* Check whether the start column can slide right. */
-			while (start_column < stop_column)
-			{
-				int			net_inserts = start_column - zp;
-
-				if (prev[start_column] +
-					(net_inserts > 0 ? net_inserts * ins_c :
-					 -net_inserts * del_c) <= max_d)
-					break;
-
-				/*
-				 * We'll never again update these values, so we must make sure
-				 * there's nothing here that could confuse any future
-				 * iteration of the outer loop.
-				 */
-				prev[start_column] = max_d + 1;
-				curr[start_column] = max_d + 1;
-				if (start_column != 0)
-					s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
-				start_column++;
-			}
-
-			/* If they cross, we're going to exceed the bound. */
-			if (start_column >= stop_column)
-				return max_d + 1;
-		}
-#endif
-	}
-
-	/*
-	 * Because the final value was swapped from the previous row to the
-	 * current row, that's where we'll find it.
-	 */
-	return prev[m - 1];
-}
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index bf13140..84ae29c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -7893,6 +7893,74 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
    </para>
  </sect1>
 
+ <sect1 id="functions-levenshtein">
+  <title>Levenshtein functions</title>
+
+  <para>
+   Levenshtein functions provide ways to calculate a distance between two
+   strings.
+  </para>
+
+  <table id="functions-levenshtein-table">
+    <title>Levenshtein Functions</title>
+    <tgroup cols="4">
+     <thead>
+      <row>
+       <entry>Function</entry>
+       <entry>Description</entry>
+       <entry>Example</entry>
+       <entry>Example Result</entry>
+      </row>
+     </thead>
+     <tbody>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein</primary>
+         </indexterm>
+         <literal>levenshtein(text source, text target [, int ins_cost, int del_cost, int sub_cost])</literal>
+       </entry>
+       <entry>
+        Returns the distance between the two given strings <literal>source</>
+        and <literal>target</>.
+       </entry>
+       <entry><literal>levenshtein('GUMBO', 'GAMBOL')</literal></entry>
+       <entry><literal>2</literal></entry>
+      </row>
+      <row>
+       <entry>
+         <indexterm>
+          <primary>levenshtein_less_equal</primary>
+         </indexterm>
+         <literal>levenshtein_less_equal(text source, text target [, int ins_cost, int del_cost, int sub_cost], int max_d)</literal>
+       </entry>
+       <entry>
+        Returns the less-equal distance between the two strings
+        <literal>source</> and <literal>target</>.
+       </entry>
+       <entry><literal>levenshtein_less_equal('extensive', 'exhaustive', 2)</literal></entry>
+       <entry><literal>3</literal></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    Both <literal>source</literal> and <literal>target</literal> can be any
+    non-null string, with a maximum of 255 bytes.  The cost parameters
+    <literal>ins_cost</>, <literal>del_cost</> and <literal>sub_cost</>
+    specify how much to charge for a character insertion, deletion, or
+    substitution, respectively.  You can omit the cost parameters, as in
+    the second version of the function; in that case they all default to 1.
+    <literal>levenshtein_less_equal</literal> is accelerated version of
+    levenshtein function for low values of distance. If actual distance
+    is less or equal then <literal>max_d</>, then
+    <literal>levenshtein_less_equal</literal> returns accurate value of it.
+    Otherwise this function returns value which is greater than
+    <literal>max_d</>.
+   </para>
+ </sect1>
+
  <sect1 id="functions-geometry">
   <title>Geometric Functions and Operators</title>
 
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index f26bd90..f95d5aa 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -83,72 +83,6 @@ SELECT * FROM s WHERE difference(s.nm, 'john') &gt; 2;
  </sect2>
 
  <sect2>
-  <title>Levenshtein</title>
-
-  <para>
-   This function calculates the Levenshtein distance between two strings:
-  </para>
-
-  <indexterm>
-   <primary>levenshtein</primary>
-  </indexterm>
-
-  <indexterm>
-   <primary>levenshtein_less_equal</primary>
-  </indexterm>
-
-<synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
-</synopsis>
-
-  <para>
-   Both <literal>source</literal> and <literal>target</literal> can be any
-   non-null string, with a maximum of 255 bytes.  The cost parameters
-   specify how much to charge for a character insertion, deletion, or
-   substitution, respectively.  You can omit the cost parameters, as in
-   the second version of the function; in that case they all default to 1.
-   <literal>levenshtein_less_equal</literal> is accelerated version of
-   levenshtein function for low values of distance. If actual distance
-   is less or equal then max_d, then <literal>levenshtein_less_equal</literal>
-   returns accurate value of it. Otherwise this function returns value
-   which is greater than max_d.
-  </para>
-
-  <para>
-   Examples:
-  </para>
-
-<screen>
-test=# SELECT levenshtein('GUMBO', 'GAMBOL');
- levenshtein
--------------
-           2
-(1 row)
-
-test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2,1,1);
- levenshtein
--------------
-           3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',2);
- levenshtein_less_equal
-------------------------
-                      3
-(1 row)
-
-test=# SELECT levenshtein_less_equal('extensive', 'exhaustive',4);
- levenshtein_less_equal
-------------------------
-                      4
-(1 row)
-</screen>
- </sect2>
-
- <sect2>
   <title>Metaphone</title>
 
   <para>
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..7071afe 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,8 +22,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	encode.o enum.o float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
 	int8.o json.o jsonb.o jsonb_gin.o jsonb_op.o jsonb_util.o \
-	jsonfuncs.o like.o lockfuncs.o mac.o misc.o nabstime.o name.o \
-	network.o network_gist.o network_selfuncs.o \
+	jsonfuncs.o levenshtein.o like.o lockfuncs.o mac.o misc.o nabstime.o \
+	name.o network.o network_gist.o network_selfuncs.o \
 	numeric.o numutils.o oid.o oracle_compat.o \
 	orderedsetaggs.o pg_lzcompress.o pg_locale.o pg_lsn.o \
 	pgstatfuncs.o pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..e405ad8
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,565 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ *	  Levenshtein distance implementation.
+ *
+ * Original author:  Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costsings and a "max" value above which exact distances are not
+ * interesting.  Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * All arguments should be strlen(s) <= MAX_LEVENSHTEIN_STRLEN.
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm. Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration.  Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/levenshtein.h"
+
+#include "fmgr.h"
+#include "utils/builtins.h"
+
+#include "mb/pg_wchar.h"
+
+/*
+ * Maximum length of strings authorized for distance calculation.
+ */
+#define MAX_LEVENSHTEIN_STRLEN		255
+
+/*
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+	while (len > 0)
+	{
+		len--;
+		if (s1[len] != s2[len])
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Calculates Levenshtein distance metric between supplied cstrings, which are
+ * not necessarily null-terminated.  Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t.  The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification.  In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound.  From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row.  This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length.  The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix.  When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+
+/*
+ * levenshtein_common
+ *
+ * Common routine for all Levenstein functions.
+ */
+static int
+levenshtein_common(const char *source, int slen, const char *target,
+					int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+	int			m, n;
+	int		   *prev;
+	int		   *curr;
+	int		   *s_char_len = NULL;
+	int			i,
+				j;
+	const char *y;
+	int			max_init;
+	int			start_column,
+				stop_column;
+	int			start_column_local, stop_column_local;
+
+	/* Save value of max_d */
+	max_init = max_d;
+
+	m = pg_mbstrlen_with_len(source, slen);
+	n = pg_mbstrlen_with_len(target, tlen);
+
+	/*
+	 * We can transform an empty s into t with n insertions, or a non-empty t
+	 * into an empty s with m deletions.
+	 */
+	if (!m)
+		return n * ins_c;
+	if (!n)
+		return m * del_c;
+
+	/*
+	 * A common use for Levenshtein distance is to match column names.
+	 * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+	 * guaranteed to work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+	/*
+	 * For security concerns, restrict excessive CPU+RAM usage. (This
+	 * implementation uses O(m) memory and has O(mn) complexity.)
+	 */
+	if (m > MAX_LEVENSHTEIN_STRLEN ||
+		n > MAX_LEVENSHTEIN_STRLEN)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("argument exceeds the maximum length of %d bytes",
+						MAX_LEVENSHTEIN_STRLEN)));
+
+	/* This optimization is done only for less-equal calculation */
+	if (max_init >= 0)
+	{
+		/* Initialize start and stop columns. */
+		start_column = 0;
+		stop_column = m + 1;
+
+		/*
+		 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
+		 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
+		 * enough to limit the computation we must perform.  If so, figure out
+		 * initial stop column.
+		 */
+		if (max_d >= 0)
+		{
+			int			min_theo_d; /* Theoretical minimum distance. */
+			int			max_theo_d; /* Theoretical maximum distance. */
+			int			net_inserts = n - m;
+
+			min_theo_d = net_inserts < 0 ?
+				-net_inserts * del_c : net_inserts * ins_c;
+			if (min_theo_d > max_d)
+				return max_d + 1;
+			if (ins_c + del_c < sub_c)
+				sub_c = ins_c + del_c;
+			max_theo_d = min_theo_d + sub_c * Min(m, n);
+			if (max_d >= max_theo_d)
+				max_d = -1;
+			else if (ins_c + del_c > 0)
+			{
+				/*
+				 * Figure out how much of the first row of the notional matrix we
+				 * need to fill in.  If the string is growing, the theoretical
+				 * minimum distance already incorporates the cost of deleting the
+				 * number of characters necessary to make the two strings equal in
+				 * length.  Each additional deletion forces another insertion, so
+				 * the best-case total cost increases by ins_c + del_c. If the
+				 * string is shrinking, the minimum theoretical cost assumes no
+				 * excess deletions; that is, we're starting no further right than
+				 * column n - m.  If we do start further right, the best-case
+				 * total cost increases by ins_c + del_c for each move right.
+				 */
+				int			slack_d = max_d - min_theo_d;
+				int			best_column = net_inserts < 0 ? -net_inserts : 0;
+
+				stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+				if (stop_column > m)
+					stop_column = m + 1;
+			}
+		}
+	}
+	else
+	{
+		/*
+		 * Be sure to set if correctly stop and start columns in all cases.
+		 */
+		start_column = 0;
+		stop_column = m;
+	}
+
+	/*
+	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
+	 * we cache all the lengths before starting the main loop -- but if all
+	 * the characters in both strings are single byte, then we skip this and
+	 * use a fast-path in the main loop.  If only one string contains
+	 * multi-byte characters, we still build the array, so that the fast-path
+	 * needn't deal with the case where the array hasn't been initialized.
+	 */
+	if (m != slen || n != tlen)
+	{
+		int			i;
+		const char *cp = source;
+
+		s_char_len = (int *) palloc((m + 1) * sizeof(int));
+		for (i = 0; i < m; ++i)
+		{
+			s_char_len[i] = pg_mblen(cp);
+			cp += s_char_len[i];
+		}
+		s_char_len[i] = 0;
+	}
+
+	/* One more cell for initialization column and row. */
+	++m;
+	++n;
+
+	/* Previous and current rows of notional array. */
+	prev = (int *) palloc(2 * m * sizeof(int));
+	curr = prev + m;
+
+	/*
+	 * To transform the first i characters of s into the first 0 characters of
+	 * t, we must perform i deletions.
+	 */
+	if (max_init >= 0)
+		stop_column_local = stop_column;
+	else
+		stop_column_local = m;
+
+	for (i = 0; i < stop_column_local; i++)
+		prev[i] = i * del_c;
+
+	/* Loop through rows of the notional array */
+	for (y = target, j = 1; j < n; j++)
+	{
+		int		   *temp;
+		const char *x = source;
+		int			y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+		/* This optimization is done only for less-equal calculation */
+		if (max_init >= 0)
+		{
+			/*
+			 * In the best case, values percolate down the diagonal unchanged, so
+			 * we must increment stop_column unless it's already on the right end
+			 * of the array.  The inner loop will read prev[stop_column], so we
+			 * have to initialize it even though it shouldn't affect the result.
+			 */
+			if (stop_column < m)
+			{
+				prev[stop_column] = max_d + 1;
+				++stop_column;
+			}
+
+			/*
+			 * The main loop fills in curr, but curr[0] needs a special case: to
+			 * transform the first 0 characters of s into the first j characters
+			 * of t, we must perform j insertions.  However, if start_column > 0,
+			 * this special case does not apply.
+			 */
+			if (start_column == 0)
+			{
+				curr[0] = j * ins_c;
+				i = 1;
+			}
+			else
+				i = start_column;
+		}
+		else
+		{
+			curr[0] = j * ins_c;
+			i = 1;
+		}
+
+		/*
+		 * This inner loop is critical to performance, so we include a
+		 * fast-path to handle the (fairly common) case where no multibyte
+		 * characters are in the mix.  The fast-path is entitled to assume
+		 * that if s_char_len is not initialized then BOTH strings contain
+		 * only single-byte characters.
+		 */
+		if (s_char_len != NULL)
+		{
+			if (max_init < 0)
+				stop_column_local = m;
+			else
+				stop_column_local = stop_column;
+
+			for (; i < stop_column_local; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+				int			x_char_len = s_char_len[i - 1];
+
+				/*
+				 * Calculate costs for insertion, deletion, and substitution.
+				 *
+				 * When calculating cost for substitution, we compare the last
+				 * character of each possibly-multibyte character first,
+				 * because that's enough to rule out most mis-matches.  If we
+				 * get past that test, then we compare the lengths and the
+				 * remaining bytes.
+				 */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				if (x[x_char_len - 1] == y[y_char_len - 1]
+					&& x_char_len == y_char_len &&
+					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+					sub = prev[i - 1];
+				else
+					sub = prev[i - 1] + sub_c;
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x += x_char_len;
+			}
+		}
+		else
+		{
+			if (max_init < 0)
+				stop_column_local = m;
+			else
+				stop_column_local = stop_column;
+
+			for (; i < stop_column_local; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+
+				/* Calculate costs for insertion, deletion, and substitution. */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x++;
+			}
+		}
+
+		/* Swap current row with previous row. */
+		temp = curr;
+		curr = prev;
+		prev = temp;
+
+		/* Point to next character. */
+		y += y_char_len;
+
+		/*
+		 * This chunk of code represents a significant performance hit if used
+		 * in the case where there is no max_d bound.  This is probably not
+		 * because the max_d >= 0 test itself is expensive, but rather because
+		 * the possibility of needing to execute this code prevents tight
+		 * optimization of the loop as a whole.
+		 */
+		if (max_init >= 0 && max_d >= 0)
+		{
+			/*
+			 * The "zero point" is the column of the current row where the
+			 * remaining portions of the strings are of equal length.  There
+			 * are (n - 1) characters in the target string, of which j have
+			 * been transformed.  There are (m - 1) characters in the source
+			 * string, so we want to find the value for zp where (n - 1) - j =
+			 * (m - 1) - zp.
+			 */
+			int			zp = j - (n - m);
+
+			/* Check whether the stop column can slide left. */
+			while (stop_column > 0)
+			{
+				int			ii = stop_column - 1;
+				int			net_inserts = ii - zp;
+
+				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+								-net_inserts * del_c) <= max_d)
+					break;
+				stop_column--;
+			}
+
+			/* Check whether the start column can slide right. */
+			while (start_column < stop_column)
+			{
+				int			net_inserts = start_column - zp;
+
+				if (prev[start_column] +
+					(net_inserts > 0 ? net_inserts * ins_c :
+					 -net_inserts * del_c) <= max_d)
+					break;
+
+				/*
+				 * We'll never again update these values, so we must make sure
+				 * there's nothing here that could confuse any future
+				 * iteration of the outer loop.
+				 */
+				prev[start_column] = max_d + 1;
+				curr[start_column] = max_d + 1;
+				if (start_column != 0)
+					source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+				start_column++;
+			}
+
+			/* If they cross, we're going to exceed the bound. */
+			if (start_column >= stop_column)
+				return max_d + 1;
+		}
+	}
+
+	/*
+	 * Because the final value was swapped from the previous row to the
+	 * current row, that's where we'll find it.
+	 */
+	return prev[m - 1];
+}
+
+/*
+ * levenshtein_internal
+ * levenshtein_less_equal_internal
+ *
+ * Internal procedures for Levenshtein distance calculation.
+ */
+int
+levenshtein_less_equal_internal(const char *source, int slen, const char *target,
+						int tlen, int ins_c, int del_c, int sub_c, int max_d)
+{
+	return levenshtein_common(source, slen, target, tlen, ins_c,
+						del_c, sub_c, max_d);
+}
+
+int
+levenshtein_internal(const char *source, int slen, const char *target, int tlen,
+			 int ins_c, int del_c, int sub_c)
+{
+	return levenshtein_common(source, slen, target, tlen, ins_c,
+						del_c, sub_c, -1);
+}
+
+/*
+ * levenshtein
+ *
+ * Calculate the Levenshtein distance of two strings.
+ */
+Datum
+levenshtein(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+			t_bytes, 1, 1, 1));
+}
+
+/*
+ * levenshtein_with_costs
+ *
+ * Improved version of levenshtein with costs for character insertion,
+ * deletion and substitution.
+ */
+Datum
+levenshtein_with_costs(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	int			ins_c = PG_GETARG_INT32(2);
+	int			del_c = PG_GETARG_INT32(3);
+	int			sub_c = PG_GETARG_INT32(4);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_internal(s_data, s_bytes, t_data,
+			t_bytes, ins_c, del_c, sub_c));
+}
+
+/*
+ * levenshtein_less_equal
+ *
+ * Accelerated version of levenshtein for low distances.
+ */
+Datum
+levenshtein_less_equal(PG_FUNCTION_ARGS)
+{
+	text		*src = PG_GETARG_TEXT_PP(0);
+	text		*dst = PG_GETARG_TEXT_PP(1);
+	int			 max_d = PG_GETARG_INT32(2);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+			t_data, t_bytes, 1, 1, 1, max_d));
+}
+
+/*
+ * levenshtein_less_equal_with_costs
+ *
+ * Accelerated version of levenshtein for low distances with costs for
+ * character insertion, deletion and substitution.
+ */
+Datum
+levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
+{
+	text	   *src = PG_GETARG_TEXT_PP(0);
+	text	   *dst = PG_GETARG_TEXT_PP(1);
+	int			ins_c = PG_GETARG_INT32(2);
+	int			del_c = PG_GETARG_INT32(3);
+	int			sub_c = PG_GETARG_INT32(4);
+	int			max_d = PG_GETARG_INT32(5);
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes, t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(levenshtein_less_equal_internal(s_data, s_bytes,
+			t_data, t_bytes, ins_c, del_c, sub_c, max_d));
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..e7bd6ff 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4973,6 +4973,16 @@ DESCR("peek at changes from replication slot");
 DATA(insert OID = 3785 (  pg_logical_slot_peek_binary_changes PGNSP PGUID 12 1000 1000 25 0 f f f f f t v 4 0 2249 "19 3220 23 1009" "{19,3220,23,1009,3220,28,17}" "{i,i,i,v,o,o,o}" "{slot_name,upto_lsn,upto_nchanges,options,location,xid,data}" _null_ pg_logical_slot_peek_binary_changes _null_ _null_ _null_ ));
 DESCR("peek at binary changes from replication slot");
 
+/* levenshtein distance */
+DATA(insert OID = 3366 ( levenshtein	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "25 25" _null_ _null_ _null_ _null_ levenshtein _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings");
+DATA(insert OID = 3367 ( levenshtein	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 23 "25 25 23 23 23" _null_ _null_ _null_ _null_ levenshtein_with_costs _null_ _null_ _null_));
+DESCR("Levenshtein distance between two strings with costs");
+DATA(insert OID = 3368 ( levenshtein_less_equal	PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 23 "25 25 23" _null_ _null_ _null_ _null_ levenshtein_less_equal _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings");
+DATA(insert OID = 3369 ( levenshtein_less_equal PGNSP PGUID 12 1 0 0 0 f f f f t f i 6 0 23 "25 25 23 23 23 23" _null_ _null_ _null_ _null_ levenshtein_less_equal_with_costs _null_ _null_ _null_));
+DESCR("Less-equal Levenshtein distance between two strings with costs");
+
 /* event triggers */
 DATA(insert OID = 3566 (  pg_event_trigger_dropped_objects		PGNSP PGUID 12 10 100 0 0 f f f f t t s 0 0 2249 "" "{26,26,23,25,25,25,25}" "{o,o,o,o,o,o,o}" "{classid, objid, objsubid, object_type, schema_name, object_name, object_identity}" _null_ pg_event_trigger_dropped_objects _null_ _null_ _null_ ));
 DESCR("list objects dropped by the current command");
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index bbb5d39..b468c3c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -851,6 +851,12 @@ extern Datum cidrecv(PG_FUNCTION_ARGS);
 extern Datum cidsend(PG_FUNCTION_ARGS);
 extern Datum cideq(PG_FUNCTION_ARGS);
 
+/* levenshtein.c */
+extern Datum levenshtein(PG_FUNCTION_ARGS);
+extern Datum levenshtein_with_costs(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal(PG_FUNCTION_ARGS);
+extern Datum levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS);
+
 /* like.c */
 extern Datum namelike(PG_FUNCTION_ARGS);
 extern Datum namenlike(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/levenshtein.h b/src/include/utils/levenshtein.h
new file mode 100644
index 0000000..6fa01d0
--- /dev/null
+++ b/src/include/utils/levenshtein.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.h
+ *	  Header file for the Levenshtein distance functions.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * src/include/utils/levenshtein.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+#include "postgres.h"
+
+/* Internal functions */
+extern int levenshtein_less_equal_internal(const char *source, int slen,
+								  const char *target, int tlen,
+								  int ins_c, int del_c, int sub_c, int max_d);
+
+extern int levenshtein_internal(const char *source, int slen,
+					   const char *target, int tlen,
+					   int ins_c, int del_c, int sub_c);
+
+#endif
diff --git a/src/test/regress/expected/levenshtein.out b/src/test/regress/expected/levenshtein.out
new file mode 100644
index 0000000..e5a77c3
--- /dev/null
+++ b/src/test/regress/expected/levenshtein.out
@@ -0,0 +1,27 @@
+--
+-- LEVENSHTEIN
+--
+SELECT levenshtein('GUMBO', 'GAMBOL');
+ levenshtein 
+-------------
+           2
+(1 row)
+
+SELECT levenshtein('GUMBO', 'GAMBOL', 2, 1, 1);
+ levenshtein 
+-------------
+           3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+ levenshtein_less_equal 
+------------------------
+                      3
+(1 row)
+
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 1, 1, 1, 4);
+ levenshtein_less_equal 
+------------------------
+                      4
+(1 row)
+
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index c0416f4..5faf182 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -13,7 +13,7 @@ test: tablespace
 # ----------
 # The first group of parallel tests
 # ----------
-test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
+test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc levenshtein
 
 # Depends on things setup during char, varchar and text
 test: strings
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 16a1905..e980619 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -21,6 +21,7 @@ test: money
 test: rangetypes
 test: pg_lsn
 test: regproc
+test: levenshtein
 test: strings
 test: numerology
 test: point
diff --git a/src/test/regress/sql/levenshtein.sql b/src/test/regress/sql/levenshtein.sql
new file mode 100644
index 0000000..7806bde
--- /dev/null
+++ b/src/test/regress/sql/levenshtein.sql
@@ -0,0 +1,8 @@
+--
+-- LEVENSHTEIN
+--
+
+SELECT levenshtein('GUMBO', 'GAMBOL');
+SELECT levenshtein('GUMBO', 'GAMBOL', 2, 1, 1);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
+SELECT levenshtein_less_equal('extensive', 'exhaustive', 1, 1, 1, 4);
-- 
2.0.1

0002-Support-for-column-hints.patchtext/x-diff; charset=US-ASCII; name=0002-Support-for-column-hints.patchDownload
From 7574945acced9af579e32f5e216483a1a50e0218 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 17 Jul 2014 21:58:25 +0900
Subject: [PATCH 2/2] Support for column hints

If incorrect column names are written in a query, system tries to
evaluate if there are columns on existing RTEs that are close in
distance to the one mistaken, and returns to user hints according
to the evaluation done.
---
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 318 ++++++++++++++++++++++++++----
 src/include/parser/parse_relation.h       |   3 +-
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out        |  39 ++++
 src/test/regress/expected/plpgsql.out     |   1 +
 src/test/regress/expected/rowtypes.out    |   1 +
 src/test/regress/expected/rules.out       |   1 +
 src/test/regress/expected/without_oid.out |   1 +
 src/test/regress/sql/join.sql             |  24 +++
 11 files changed, 366 insertions(+), 41 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..2838f89 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -28,6 +29,7 @@
 #include "parser/parse_relation.h"
 #include "parser/parse_type.h"
 #include "utils/builtins.h"
+#include "utils/levenshtein.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
@@ -520,6 +522,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * distanceName
+ *	  Return Levenshtein distance between an actual column name and possible
+ *	  partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+	int len = strlen(actual),
+		match_len = strlen(match);
+
+	/* Charge half as much per deletion as per insertion or per substitution */
+	return levenshtein_less_equal_internal(actual, len, match, match_len,
+								   2, 1, 2, max);
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +545,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance.  Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match).  This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful.  Note that we never consider system column names
+ * when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, AttrNumber *closest, int *distance)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +580,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +602,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		if (distance && *distance != 0)
+		{
+			if (result)
+			{
+				/* Exact match just found */
+				*distance = 0;
+			}
+			else
+			{
+				int lowestdistance = *distance;
+				int thisdistance = distanceName(attcolname, colname,
+												lowestdistance);
+
+				if (thisdistance >= lowestdistance)
+				{
+					/*
+					 * This match distance may equal a prior match within this
+					 * same range table.  When that happens, the prior match is
+					 * discarded as worthless, since a single best match is
+					 * required within a RTE.
+					 */
+					if (thisdistance == lowestdistance)
+						*closest = InvalidAttrNumber;
+
+					continue;
+				}
+
+				/* Store new lowest observed distance for RT */
+				*distance = thisdistance;
+			}
+			*closest = attnum;
+		}
 	}
 
 	/*
@@ -642,7 +711,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL, NULL);
 
 			if (newresult)
 			{
@@ -668,8 +738,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns a list of equally likely
+ *	  candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily; we provide the closet columns if there
+ * was not an exact match.  Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +754,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location, AttrNumber closest[2])
 {
-	ParseState *orig_pstate = pstate;
+	ParseState	   *orig_pstate = pstate;
+	int				distance = INT_MAX;
+	List		   *matchedrte = NIL;
+	ListCell	   *l;
+	int				i;
 
 	while (pstate != NULL)
 	{
-		ListCell   *l;
-
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+			AttrNumber		rteclosest = InvalidAttrNumber;
+			int				rtdistance = INT_MAX;
+			bool			wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Get single best match from each RTE, or no match for RTE if
+			 * there is a tie for best match within a given RTE
+			 */
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+							 &rtdistance);
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtdistance == 0)
+			{
+				/* Exact match (for "wrong alias" or "wrong level" cases) */
+				closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+				/*
+				 * Any exact match is always the uncontested best match.  It
+				 * doesn't seem worth considering the case where there are
+				 * multiple exact matches, so we're done.
+				 */
+				matchedrte = lappend(NIL, rte);
+				return matchedrte;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtdistance += 3;
+
+			if (rteclosest != InvalidAttrNumber)
+			{
+				if (rtdistance >= distance)
+				{
+					/*
+					 * Perhaps record this attribute as being just as close in
+					 * distance to closest attribute observed so far across
+					 * entire range table.  Iff this distance is ultimately the
+					 * lowest distance observed overall, it may end up as the
+					 * second match.
+					 */
+					if (rtdistance == distance)
+					{
+						closest[1] = rteclosest;
+						matchedrte = lappend(matchedrte, rte);
+					}
+
+					continue;
+				}
+
+				/*
+				 * One best match (better than any others in previous RTEs) was
+				 * found within this RTE
+				 */
+				distance = rtdistance;
+				/* New uncontested best match */
+				matchedrte = lappend(NIL, rte);
+				closest[0] = rteclosest;
+			}
+			else
+			{
+				/*
+				 * Even though there were perhaps multiple joint-best matches
+				 * within this RTE (implying that there can be no attribute
+				 * suggestion from it), the shortest distance should still
+				 * serve as the distance for later RTEs to beat (but naturally
+				 * only if it happens to be the lowest so far across the entire
+				 * range table).
+				 */
+				distance = Min(distance, rtdistance);
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Too many equally close partial matches found?
+	 *
+	 * It's useful to provide two matches for the common case where two range
+	 * tables each have one equally distant candidate column, as when an
+	 * unqualified (and therefore would-be ambiguous) column name is specified
+	 * which is also misspelled by the user.  It seems unhelpful to show no
+	 * hint when this occurs, since in practice one attribute probably
+	 * references the other in a foreign key relationship.  However, when there
+	 * are more than 2 range tables with equally distant matches that's
+	 * probably because the matches are not useful, so don't suggest anything.
+	 */
+	if (list_length(matchedrte) > 2)
+		return NIL;
+
+	/*
+	 * Handle dropped columns, which can appear here as empty colnames per
+	 * remarks within scanRTEForColumn().  If either the first or second
+	 * suggested attributes are dropped, do not provide any suggestion.
+	 */
+	i = 0;
+	foreach(l, matchedrte)
+	{
+		RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+		char		   *closestcol;
+
+		closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+		if (strcmp(closestcol, "") == 0)
+			return NIL;
+	}
+
+	/*
+	 * Distance must be less than a normalized threshold in order to avoid
+	 * completely ludicrous suggestions.  Note that a distance of 6 will be
+	 * seen when 6 deletions are required against actual attribute name, or 3
+	 * insertions/substitutions.
+	 */
+	if (distance > 6 && distance > strlen(colname) * 2 / 2)
+		return NIL;
+
+	return matchedrte;
 }
 
 /*
@@ -2855,41 +3050,92 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
 /*
  * Generate a suitable error about a missing column.
  *
- * Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * Since this is a very common type of error, we work rather hard to produce a
+ * helpful message, going so far as to guess user's intent when a missing
+ * column name is probably intended to reference one of two would-be ambiguous
+ * attributes (when no alias/qualification was provided).
  */
 void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	List		   *matchedrte;
+	AttrNumber	    closest[2];
+	RangeTblEntry  *rte1 = NULL,
+				   *rte2 = NULL;
+	char		   *closestcol1;
+	char		   *closestcol2;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+	 * the event of an exact match there is only ever one suggestion
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	closest[0] = closest[1] = InvalidAttrNumber;
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
 	 *
 	 * TODO: improve this code (and also errorMissingRTE) to mention using
 	 * LATERAL if appropriate.
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+										closest);
+
+	/*
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
+	 *
+	 * Extract RTEs for best match, if any, and joint best match, if any.
+	 */
+	if (matchedrte)
+	{
+		rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+		if (list_length(matchedrte) > 1)
+			rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+		if (rte1 && closest[0] != InvalidAttrNumber)
+			closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+		if (rte2 && closest[1] != InvalidAttrNumber)
+			closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+	}
+
+	if (!rte2)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 *
+		 * Infer an exact match referenced despite not being visible from the
+		 * fact that an attribute number was not passed back.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 rte1? closest[0] != InvalidAttrNumber?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, rte1->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Handle case where there are two equally useful column hints */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1,
+						 rte2->eref->aliasname, closestcol2),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, AttrNumber *matchedatt,
+				 int *distance);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 9b89e58..77829dc 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altstartwith;
                ^
+HINT:  Perhaps you meant to reference the column "altstartwith"."col".
 select * from altstartwith;
  col 
 -----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altinhoid;
                ^
+HINT:  Perhaps you meant to reference the column "altinhoid"."col".
 select * from altwithoid;
  col 
 -----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 1cb1c51..f4edcbe 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3388,6 +3394,39 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 8892bb4..2cb4aa1 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4771,6 +4771,7 @@ END$$;
 ERROR:  column "foo" does not exist
 LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
                                         ^
+HINT:  Perhaps you meant to reference the column "room"."roomno".
 QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
 CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
 -- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname;  -- error
 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
                ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".
 -- same, but RECORD instead of named composite type:
 select cast (row('Jim', 'Beam') as text);
     row     
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..48c75fd 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2368,6 +2368,7 @@ select xmin, * from fooview;  -- fail, views don't have such a column
 ERROR:  column "xmin" does not exist
 LINE 1: select xmin, * from fooview;
                ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".
 select reltoastrelid, relkind, relfrozenxid
   from pg_class where oid = 'fooview'::regclass;
  reltoastrelid | relkind | relfrozenxid 
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
 ERROR:  column "oid" does not exist
 LINE 1: SELECT count(oid) FROM wo;
                      ^
+HINT:  Perhaps you meant to reference the column "wo"."i".
 VACUUM ANALYZE wi;
 VACUUM ANALYZE wo;
 SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index fa3e068..4d60f9e 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1047,6 +1051,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
2.0.1

#67Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#64)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Jul 17, 2014 at 9:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Patch 1 does a couple of things:
- fuzzystrmatch is dumped to 1.1, as Levenshtein functions are not part of
it anymore, and moved to core.
- Removal of the LESS_EQUAL flag that made the original submission patch
harder to understand. All the Levenshtein functions wrap a single common
function.
- Documentation is moved, and regression tests for Levenshtein functions are
added.
- Functions with costs are renamed with a suffix with costs.
After hacking this feature, I came up with the conclusion that it would be
better for the user experience to move directly into backend code all the
Levenshtein functions, instead of only moving in the common wrapper as Peter
did in his original patches. This is done this way to avoid keeping portions
of the same feature in two different places of the code (backend with common
routine, fuzzystrmatch with levenshtein functions) and concentrate all the
logic in a single place. Now, we may as well consider renaming the
levenshtein functions into smarter names, like str_distance, and keep
fuzzystrmatch to 1.0, having the functions levenshteing_* calling only the
str_distance functions.

This is not cool. Anyone who is running a 9.4 or earlier database
using fuzzystrmatch and upgrades, either via dump-and-restore or
pg_upgrade, to a version with this patch applied will have a broken
database. They will still have the catalog entries for the 1.0
definitions, but those definitions won't be resolvable inside the new
cluster's .so file. The user will get a fairly-unfriendly error
message that won't go away until they upgrade the extension, which may
involve dealing with dependency hell since the new definitions are in
a different place than the old definitions, and there may be
dependencies on the old definitions. One of the great advantages of
extension packaging is that this kind of problem is quite easily
avoidable, so let's avoid it.

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core. Then, the
whole thing will be completely transparent to users. They won't need
to upgrade their fuzzystrmatch definitions at all, and everything will
just work; under the covers, the fuzzystrmatch code will now be
calling into core code rather than to code located in that same
module, but the user doesn't need to know or care about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#67)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Robert Haas <robertmhaas@gmail.com> writes:

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.

I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.

It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#67)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 23, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.

For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#69)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

For some reason I thought that that was what Michael was proposing - a
more comprehensive move of code into core than the structuring that I
proposed. I actually thought about a Levenshtein distance operator at
one point months ago, before I entirely gave up on that. The
MAX_LEVENSHTEIN_STRLEN limitation made me think that the Levenshtein
distance functions are not suitable for core as is (although that
doesn't matter for my purposes, since all I need is something that
accommodates NAMEDATALEN sized strings). MAX_LEVENSHTEIN_STRLEN is a
considerable limitation for an in-core feature. I didn't get around to
forming an opinion on how and if that should be fixed.

I had two thoughts:

1. Should we consider making levenshtein available to frontend programs
as well as backend?
2. Would it provide better matching to use Damerau-Levenshtein[1]http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance instead
of raw Levenshtein?

.oO(Would anyone be so bold as to attempt to implement bitap[2]http://en.wikipedia.org/wiki/Bitap_algorithm using
bitmapsets ...)

[1]: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
[2]: http://en.wikipedia.org/wiki/Bitap_algorithm

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#70)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Jul 23, 2014 at 1:10 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I had two thoughts:

1. Should we consider making levenshtein available to frontend programs
as well as backend?

I don't think so. Why would that be useful?

2. Would it provide better matching to use Damerau-Levenshtein[1] instead
of raw Levenshtein?

Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#68)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Jul 24, 2014 at 1:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

There are several possible methods of doing that, but I think the best
one is just to leave the SQL-callable C functions in fuzzystrmatch and
move only the underlying code that supports into core.

I hadn't been paying close attention to this thread, but I'd just assumed
that that would be the approach.

It might be worth introducing new differently-named pg_proc entries for
the same functions in core, but only if we can agree that there are better
names for them than what the extension uses.

Yes, that's a point I raised upthread as well. What about renaming those
functions as string_distance and string_distance_less_than? Then have only
fuzzystrmatch do some DirectFunctionCall using the in-core functions?
--
Michael

#73Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#71)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

Maybe that would be marginally better than classic Levenshtein
distance, but I doubt it would pay for itself. It's just more code to
maintain. Are we really expecting to not get the best possible
suggestion due to some number of transposition errors very frequently?
You still have to have a worse suggestion spuriously get ahead of
yours, and typically there just aren't that many to begin with. I'm
not targeting spelling errors so much as thinkos around plurals and
whether or not an underscore was used. Damerau-Levenshtein seems like
an algorithm with fairly specialized applications.

Yes, it's for typos. I guess it's an unfrequent scenario to have both a
typoed column and a column that's missing the plural declension, which
is the case in which Damerau-Lvsh would be a win.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#66)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 07/18/2014 10:47 AM, Michael Paquier wrote:

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

Fine for me. I'll move this patch to the next stage then.

There are a bunch of compiler warnings:

parse_relation.c: In function �errorMissingColumn�:
parse_relation.c:3114:447: warning: �closestcol1� may be used
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3066:8: note: �closestcol1� was declared here
parse_relation.c:3129:29: warning: �closestcol2� may be used
uninitialized in this function [-Wmaybe-uninitialized]
parse_relation.c:3067:8: note: �closestcol2� was declared here
levenshtein.c: In function �levenshtein_common�:
levenshtein.c:107:6: warning: unused variable �start_column_local�
[-Wunused-variable]

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#74)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Oct 6, 2014 at 3:09 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 07/18/2014 10:47 AM, Michael Paquier wrote:

On Fri, Jul 18, 2014 at 3:54 AM, Peter Geoghegan <pg@heroku.com> wrote:

I am not opposed to moving the contrib code into core in the manner
that you oppose. I don't feel strongly either way.

I noticed in passing that your revision says this *within* levenshtein.c:

+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.

Yeah, I looked at what I produced yesterday night again and came
across a couple of similar things :) And reworked a couple of things
in the version attached, mainly wordsmithing and adding comments here
and there, as well as making the naming of the Levenshtein functions
in core the same as the ones in fuzzystrmatch 1.0.

I imagined that when a committer picked this up, an executive decision
would be made one way or the other. I am quite willing to revise the
patch to alter this behavior at the request of a committer.

Fine for me. I'll move this patch to the next stage then.

There are a bunch of compiler warnings:

parse_relation.c: In function ‘errorMissingColumn’:
parse_relation.c:3114:447: warning: ‘closestcol1’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
parse_relation.c:3066:8: note: ‘closestcol1’ was declared here
parse_relation.c:3129:29: warning: ‘closestcol2’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
parse_relation.c:3067:8: note: ‘closestcol2’ was declared here
levenshtein.c: In function ‘levenshtein_common’:
levenshtein.c:107:6: warning: unused variable ‘start_column_local’
[-Wunused-variable]

Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#75)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1]/messages/by-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com -- Peter Geoghegan.

Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.

This is not a completion mechanism; it is supposed to work on
*complete* column references with slight misspellings (e.g. incorrect
use of plurals, or column references with an omitted underscore
character). Weighing Tom's concerns about suggestions that are of
absolute low quality is what makes me conclude that this is the thing
to do.

[1]: /messages/by-id/CAM3SWZTzQO=OY4jmfB-65ieFie8iHUkDErK-0oLJETm8dSrSpw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0001-Levenshtein-distance-column-HINT.patchtext/x-patch; charset=US-ASCII; name=0001-Levenshtein-distance-column-HINT.patchDownload
From 830bf9f668972ba6b531df5d4fcbd73db3472434 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in his or her ad-hoc SQL
query, or incorrectly pluralizes or fails to pluralize a column
reference.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (iff each
candidate column comes from a separate RTE).  Limiting the cases where
multiple equally likely suggestions are all offered at once is a measure
against suggestions that are of low quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.

The contrib Levenshtein distance implementation is moved from /contrib
to core.  However, the SQL-callable functions may only be used with the
fuzzystmatch extension installed, just as before -- the fuzzystmatch
definitions become mere forwarding stubs.
---
 contrib/fuzzystrmatch/Makefile            |   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c     |  81 ++++--
 contrib/fuzzystrmatch/levenshtein.c       | 403 ------------------------------
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 319 ++++++++++++++++++++---
 src/backend/utils/adt/Makefile            |   2 +
 src/backend/utils/adt/levenshtein.c       | 393 +++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c           |  25 ++
 src/include/parser/parse_relation.h       |   3 +-
 src/include/utils/builtins.h              |   5 +
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out        |  39 +++
 src/test/regress/expected/plpgsql.out     |   1 +
 src/test/regress/expected/rowtypes.out    |   1 +
 src/test/regress/expected/rules.out       |   1 +
 src/test/regress/expected/without_oid.out |   1 +
 src/test/regress/sql/join.sql             |  24 ++
 18 files changed, 849 insertions(+), 471 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len > 0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
 	int			ins_c = PG_GETARG_INT32(2);
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+								 del_c, sub_c));
 }
 
 
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
 {
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+									   1));
 }
 
 
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
 	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+											ins_c, del_c, sub_c, max_d));
 }
 
 
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
 	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+											1, 1, 1, max_d));
 }
 
 
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN		255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t.  The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification.  In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound.  From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row.  This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length.  The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix.  When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c)
-#endif
-{
-	int			m,
-				n,
-				s_bytes,
-				t_bytes;
-	int		   *prev;
-	int		   *curr;
-	int		   *s_char_len = NULL;
-	int			i,
-				j;
-	const char *s_data;
-	const char *t_data;
-	const char *y;
-
-	/*
-	 * For levenshtein_less_equal_internal, we have real variables called
-	 * start_column and stop_column; otherwise it's just short-hand for 0 and
-	 * m.
-	 */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	int			start_column,
-				stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
-	/* Extract a pointer to the actual character data. */
-	s_data = VARDATA_ANY(s);
-	t_data = VARDATA_ANY(t);
-
-	/* Determine length of each string in bytes and characters. */
-	s_bytes = VARSIZE_ANY_EXHDR(s);
-	t_bytes = VARSIZE_ANY_EXHDR(t);
-	m = pg_mbstrlen_with_len(s_data, s_bytes);
-	n = pg_mbstrlen_with_len(t_data, t_bytes);
-
-	/*
-	 * We can transform an empty s into t with n insertions, or a non-empty t
-	 * into an empty s with m deletions.
-	 */
-	if (!m)
-		return n * ins_c;
-	if (!n)
-		return m * del_c;
-
-	/*
-	 * For security concerns, restrict excessive CPU+RAM usage. (This
-	 * implementation uses O(m) memory and has O(mn) complexity.)
-	 */
-	if (m > MAX_LEVENSHTEIN_STRLEN ||
-		n > MAX_LEVENSHTEIN_STRLEN)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("argument exceeds the maximum length of %d bytes",
-						MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	/* Initialize start and stop columns. */
-	start_column = 0;
-	stop_column = m + 1;
-
-	/*
-	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
-	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
-	 * enough to limit the computation we must perform.  If so, figure out
-	 * initial stop column.
-	 */
-	if (max_d >= 0)
-	{
-		int			min_theo_d; /* Theoretical minimum distance. */
-		int			max_theo_d; /* Theoretical maximum distance. */
-		int			net_inserts = n - m;
-
-		min_theo_d = net_inserts < 0 ?
-			-net_inserts * del_c : net_inserts * ins_c;
-		if (min_theo_d > max_d)
-			return max_d + 1;
-		if (ins_c + del_c < sub_c)
-			sub_c = ins_c + del_c;
-		max_theo_d = min_theo_d + sub_c * Min(m, n);
-		if (max_d >= max_theo_d)
-			max_d = -1;
-		else if (ins_c + del_c > 0)
-		{
-			/*
-			 * Figure out how much of the first row of the notional matrix we
-			 * need to fill in.  If the string is growing, the theoretical
-			 * minimum distance already incorporates the cost of deleting the
-			 * number of characters necessary to make the two strings equal in
-			 * length.  Each additional deletion forces another insertion, so
-			 * the best-case total cost increases by ins_c + del_c. If the
-			 * string is shrinking, the minimum theoretical cost assumes no
-			 * excess deletions; that is, we're starting no further right than
-			 * column n - m.  If we do start further right, the best-case
-			 * total cost increases by ins_c + del_c for each move right.
-			 */
-			int			slack_d = max_d - min_theo_d;
-			int			best_column = net_inserts < 0 ? -net_inserts : 0;
-
-			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
-			if (stop_column > m)
-				stop_column = m + 1;
-		}
-	}
-#endif
-
-	/*
-	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
-	 * we cache all the lengths before starting the main loop -- but if all
-	 * the characters in both strings are single byte, then we skip this and
-	 * use a fast-path in the main loop.  If only one string contains
-	 * multi-byte characters, we still build the array, so that the fast-path
-	 * needn't deal with the case where the array hasn't been initialized.
-	 */
-	if (m != s_bytes || n != t_bytes)
-	{
-		int			i;
-		const char *cp = s_data;
-
-		s_char_len = (int *) palloc((m + 1) * sizeof(int));
-		for (i = 0; i < m; ++i)
-		{
-			s_char_len[i] = pg_mblen(cp);
-			cp += s_char_len[i];
-		}
-		s_char_len[i] = 0;
-	}
-
-	/* One more cell for initialization column and row. */
-	++m;
-	++n;
-
-	/* Previous and current rows of notional array. */
-	prev = (int *) palloc(2 * m * sizeof(int));
-	curr = prev + m;
-
-	/*
-	 * To transform the first i characters of s into the first 0 characters of
-	 * t, we must perform i deletions.
-	 */
-	for (i = START_COLUMN; i < STOP_COLUMN; i++)
-		prev[i] = i * del_c;
-
-	/* Loop through rows of the notional array */
-	for (y = t_data, j = 1; j < n; j++)
-	{
-		int		   *temp;
-		const char *x = s_data;
-		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * In the best case, values percolate down the diagonal unchanged, so
-		 * we must increment stop_column unless it's already on the right end
-		 * of the array.  The inner loop will read prev[stop_column], so we
-		 * have to initialize it even though it shouldn't affect the result.
-		 */
-		if (stop_column < m)
-		{
-			prev[stop_column] = max_d + 1;
-			++stop_column;
-		}
-
-		/*
-		 * The main loop fills in curr, but curr[0] needs a special case: to
-		 * transform the first 0 characters of s into the first j characters
-		 * of t, we must perform j insertions.  However, if start_column > 0,
-		 * this special case does not apply.
-		 */
-		if (start_column == 0)
-		{
-			curr[0] = j * ins_c;
-			i = 1;
-		}
-		else
-			i = start_column;
-#else
-		curr[0] = j * ins_c;
-		i = 1;
-#endif
-
-		/*
-		 * This inner loop is critical to performance, so we include a
-		 * fast-path to handle the (fairly common) case where no multibyte
-		 * characters are in the mix.  The fast-path is entitled to assume
-		 * that if s_char_len is not initialized then BOTH strings contain
-		 * only single-byte characters.
-		 */
-		if (s_char_len != NULL)
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-				int			x_char_len = s_char_len[i - 1];
-
-				/*
-				 * Calculate costs for insertion, deletion, and substitution.
-				 *
-				 * When calculating cost for substitution, we compare the last
-				 * character of each possibly-multibyte character first,
-				 * because that's enough to rule out most mis-matches.  If we
-				 * get past that test, then we compare the lengths and the
-				 * remaining bytes.
-				 */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				if (x[x_char_len - 1] == y[y_char_len - 1]
-					&& x_char_len == y_char_len &&
-					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
-					sub = prev[i - 1];
-				else
-					sub = prev[i - 1] + sub_c;
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x += x_char_len;
-			}
-		}
-		else
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-
-				/* Calculate costs for insertion, deletion, and substitution. */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x++;
-			}
-		}
-
-		/* Swap current row with previous row. */
-		temp = curr;
-		curr = prev;
-		prev = temp;
-
-		/* Point to next character. */
-		y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * This chunk of code represents a significant performance hit if used
-		 * in the case where there is no max_d bound.  This is probably not
-		 * because the max_d >= 0 test itself is expensive, but rather because
-		 * the possibility of needing to execute this code prevents tight
-		 * optimization of the loop as a whole.
-		 */
-		if (max_d >= 0)
-		{
-			/*
-			 * The "zero point" is the column of the current row where the
-			 * remaining portions of the strings are of equal length.  There
-			 * are (n - 1) characters in the target string, of which j have
-			 * been transformed.  There are (m - 1) characters in the source
-			 * string, so we want to find the value for zp where (n - 1) - j =
-			 * (m - 1) - zp.
-			 */
-			int			zp = j - (n - m);
-
-			/* Check whether the stop column can slide left. */
-			while (stop_column > 0)
-			{
-				int			ii = stop_column - 1;
-				int			net_inserts = ii - zp;
-
-				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
-								-net_inserts * del_c) <= max_d)
-					break;
-				stop_column--;
-			}
-
-			/* Check whether the start column can slide right. */
-			while (start_column < stop_column)
-			{
-				int			net_inserts = start_column - zp;
-
-				if (prev[start_column] +
-					(net_inserts > 0 ? net_inserts * ins_c :
-					 -net_inserts * del_c) <= max_d)
-					break;
-
-				/*
-				 * We'll never again update these values, so we must make sure
-				 * there's nothing here that could confuse any future
-				 * iteration of the outer loop.
-				 */
-				prev[start_column] = max_d + 1;
-				curr[start_column] = max_d + 1;
-				if (start_column != 0)
-					s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
-				start_column++;
-			}
-
-			/* If they cross, we're going to exceed the bound. */
-			if (start_column >= stop_column)
-				return max_d + 1;
-		}
-#endif
-	}
-
-	/*
-	 * Because the final value was swapped from the previous row to the
-	 * current row, that's where we'll find it.
-	 */
-	return prev[m - 1];
-}
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..9866198 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..e128adf 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..1697b77 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -520,6 +521,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * distanceName
+ *	  Return Levenshtein distance between an actual column name and possible
+ *	  partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+	int len = strlen(actual),
+		match_len = strlen(match);
+
+	/* Charge half as much per deletion as per insertion or per substitution */
+	return varstr_leven_less_equal(actual, len, match, match_len,
+								   2, 1, 2, max);
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +544,24 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname, setting *closest and
+ * *distance.  Such callers should not rely on the return value (even when
+ * there is an exact match), nor should they expect the usual side effect
+ * (unless there is an exact match).  This hardly matters in practice, since an
+ * error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, accurately report the shortest distance found overall, while not
+ * setting a "closest" attribute on the assumption that only a per-entry single
+ * closest match is useful.  Note that we never consider system column names
+ * when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, AttrNumber *closest, int *distance)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +579,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +601,39 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		if (distance && *distance != 0)
+		{
+			if (result)
+			{
+				/* Exact match just found */
+				*distance = 0;
+			}
+			else
+			{
+				int lowestdistance = *distance;
+				int thisdistance = distanceName(attcolname, colname,
+												lowestdistance);
+
+				if (thisdistance >= lowestdistance)
+				{
+					/*
+					 * This match distance may equal a prior match within this
+					 * same range table.  When that happens, the prior match is
+					 * discarded as worthless, since a single best match is
+					 * required within a RTE.
+					 */
+					if (thisdistance == lowestdistance)
+						*closest = InvalidAttrNumber;
+
+					continue;
+				}
+
+				/* Store new lowest observed distance for RT */
+				*distance = thisdistance;
+			}
+			*closest = attnum;
+		}
 	}
 
 	/*
@@ -642,7 +710,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL, NULL);
 
 			if (newresult)
 			{
@@ -668,8 +737,14 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns a list of equally likely
+ *	  candidates, or NIL in the event of no plausible candidate.
+ *
+ * Column name may be matched fuzzily;  we provide the closet columns if there
+ * was not an exact match.  Caller can depend on passed closest array to find
+ * right attribute within corresponding (first and second) returned list RTEs.
+ * If closest attributes are InvalidAttrNumber, that indicates an exact match.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +753,145 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static List *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location, AttrNumber closest[2])
 {
-	ParseState *orig_pstate = pstate;
+	ParseState	   *orig_pstate = pstate;
+	int				distance = INT_MAX;
+	List		   *matchedrte = NIL;
+	ListCell	   *l;
+	int				i;
 
 	while (pstate != NULL)
 	{
-		ListCell   *l;
-
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+			AttrNumber		rteclosest = InvalidAttrNumber;
+			int				rtdistance = INT_MAX;
+			bool			wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Get single best match from each RTE, or no match for RTE if
+			 * there is a tie for best match within a given RTE
+			 */
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rteclosest,
+							 &rtdistance);
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtdistance == 0)
+			{
+				/* Exact match (for "wrong alias" or "wrong level" cases) */
+				closest[0] = wrongalias? rteclosest : InvalidAttrNumber;
+
+				/*
+				 * Any exact match is always the uncontested best match.  It
+				 * doesn't seem worth considering the case where there are
+				 * multiple exact matches, so we're done.
+				 */
+				matchedrte = lappend(NIL, rte);
+				return matchedrte;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtdistance += 3;
+
+			if (rteclosest != InvalidAttrNumber)
+			{
+				if (rtdistance >= distance)
+				{
+					/*
+					 * Perhaps record this attribute as being just as close in
+					 * distance to closest attribute observed so far across
+					 * entire range table.  Iff this distance is ultimately the
+					 * lowest distance observed overall, it may end up as the
+					 * second match.
+					 */
+					if (rtdistance == distance)
+					{
+						closest[1] = rteclosest;
+						matchedrte = lappend(matchedrte, rte);
+					}
+
+					continue;
+				}
+
+				/*
+				 * One best match (better than any others in previous RTEs) was
+				 * found within this RTE
+				 */
+				distance = rtdistance;
+				/* New uncontested best match */
+				matchedrte = lappend(NIL, rte);
+				closest[0] = rteclosest;
+			}
+			else
+			{
+				/*
+				 * Even though there were perhaps multiple joint-best matches
+				 * within this RTE (implying that there can be no attribute
+				 * suggestion from it), the shortest distance should still
+				 * serve as the distance for later RTEs to beat (but naturally
+				 * only if it happens to be the lowest so far across the entire
+				 * range table).
+				 */
+				distance = Min(distance, rtdistance);
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Too many equally close partial matches found?
+	 *
+	 * It's useful to provide two matches for the common case where two range
+	 * tables each have one equally distant candidate column, as when an
+	 * unqualified (and therefore would-be ambiguous) column name is specified
+	 * which is also misspelled by the user.  It seems unhelpful to show no
+	 * hint when this occurs, since in practice one attribute probably
+	 * references the other in a foreign key relationship.  However, when there
+	 * are more than 2 range tables with equally distant matches that's
+	 * probably because the matches are not useful, so don't suggest anything.
+	 */
+	if (list_length(matchedrte) > 2)
+		return NIL;
+
+	/*
+	 * Handle dropped columns, which can appear here as empty colnames per
+	 * remarks within scanRTEForColumn().  If either the first or second
+	 * suggested attributes are dropped, do not provide any suggestion.
+	 */
+	i = 0;
+	foreach(l, matchedrte)
+	{
+		RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+		char		   *closestcol;
+
+		closestcol = strVal(list_nth(rte->eref->colnames, closest[i++] - 1));
+
+		if (strcmp(closestcol, "") == 0)
+			return NIL;
+	}
+
+	/*
+	 * Distance must be less than a normalized threshold in order to avoid
+	 * completely ludicrous suggestions.  Note that a distance of 6 will be
+	 * seen when 6 deletions are required against actual attribute name, or 3
+	 * insertions/substitutions.
+	 */
+	if (distance > 6 && distance > strlen(colname) / 2)
+		return NIL;
+
+	return matchedrte;
 }
 
 /*
@@ -2856,40 +3050,95 @@ errorMissingRTE(ParseState *pstate, RangeVar *relation)
  * Generate a suitable error about a missing column.
  *
  * Since this is a very common type of error, we work rather hard to
- * produce a helpful message.
+ * produce a helpful message, going so far as to guess user's intent
+ * when a missing column name is probably intended to reference one of
+ * two would-be ambiguous attributes (when no alias/qualification was
+ * provided).
  */
 void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	List		   *matchedrte;
+	AttrNumber	    closest[2];
+	RangeTblEntry  *rte1 = NULL,
+				   *rte2 = NULL;
+	char		   *closestcol1 = NULL;
+	char		   *closestcol2 = NULL;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * closest[0] will remain InvalidAttrNumber in event of exact match, and in
+	 * the event of an exact match there is only ever one suggestion
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	closest[0] = closest[1] = InvalidAttrNumber;
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
 	 *
 	 * TODO: improve this code (and also errorMissingRTE) to mention using
 	 * LATERAL if appropriate.
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	matchedrte = searchRangeTableForCol(pstate, relname, colname, location,
+										closest);
+
+	/*
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
+	 *
+	 * Extract RTEs for best match, if any, and joint best match, if any.
+	 */
+	if (matchedrte)
+	{
+		rte1 = (RangeTblEntry *) lfirst(list_head(matchedrte));
+
+		if (list_length(matchedrte) > 1)
+			rte2 = (RangeTblEntry *) lsecond(matchedrte);
+
+		if (rte1 && closest[0] != InvalidAttrNumber)
+			closestcol1 = strVal(list_nth(rte1->eref->colnames, closest[0] - 1));
+
+		if (rte2 && closest[1] != InvalidAttrNumber)
+			closestcol2 = strVal(list_nth(rte2->eref->colnames, closest[1] - 1));
+	}
+
+	if (!rte2)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 *
+		 * Infer an exact match referenced despite not being visible from the
+		 * fact that an attribute number was not passed back.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 rte1? closest[0] != InvalidAttrNumber?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, rte1->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/*
+		 * Handle case where there are two equally useful column hints, each
+		 * from a different RTE
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 rte1->eref->aliasname, closestcol1,
+						 rte2->eref->aliasname, closestcol2),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..3ea9bf4 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -38,4 +38,6 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 
 like.o: like.c like_match.c
 
+varlena.o: varlena.c levenshtein.c
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..bb8b7bf
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,393 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ *	  Levenshtein distance implementation.
+ *
+ * Original author:  Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costings and a "max" value above which exact distances are not
+ * interesting.  Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm.  Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration.  Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#define MAX_LEVENSHTEIN_STRLEN		255
+
+/*
+ * Calculates Levenshtein distance metric between supplied csrings, which are
+ * not necessarily null-terminated.  Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t.  The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification.  In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound.  From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row.  This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length.  The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix.  When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+int
+#ifdef LEVENSHTEIN_LESS_EQUAL
+varstr_leven_less_equal(const char *source, int slen, const char *target,
+						int tlen, int ins_c, int del_c, int sub_c, int max_d)
+#else
+varstr_leven(const char *source, int slen, const char *target, int tlen,
+			 int ins_c, int del_c, int sub_c)
+#endif
+{
+	int			m,
+				n;
+	int		   *prev;
+	int		   *curr;
+	int		   *s_char_len = NULL;
+	int			i,
+				j;
+	const char *y;
+
+	/*
+	 * For varstr_levenshtein_less_equal, we have real variables called
+	 * start_column and stop_column; otherwise it's just short-hand for 0 and
+	 * m.
+	 */
+#ifdef LEVENSHTEIN_LESS_EQUAL
+	int			start_column,
+				stop_column;
+
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN start_column
+#define STOP_COLUMN stop_column
+#else
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN 0
+#define STOP_COLUMN m
+#endif
+
+	m = pg_mbstrlen_with_len(source, slen);
+	n = pg_mbstrlen_with_len(target, tlen);
+
+	/*
+	 * We can transform an empty s into t with n insertions, or a non-empty t
+	 * into an empty s with m deletions.
+	 */
+	if (!m)
+		return n * ins_c;
+	if (!n)
+		return m * del_c;
+
+	/*
+	 * A common use for Levenshtein distance is to match column names.
+	 * Therefore, restrict the size of MAX_LEVENSHTEIN_STRLEN such that this is
+	 * guaranteed to work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+	/*
+	 * For security concerns, restrict excessive CPU+RAM usage. (This
+	 * implementation uses O(m) memory and has O(mn) complexity.)
+	 */
+	if (m > MAX_LEVENSHTEIN_STRLEN ||
+		n > MAX_LEVENSHTEIN_STRLEN)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("argument exceeds the maximum length of %d bytes",
+						MAX_LEVENSHTEIN_STRLEN)));
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+	/* Initialize start and stop columns. */
+	start_column = 0;
+	stop_column = m + 1;
+
+	/*
+	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
+	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
+	 * enough to limit the computation we must perform.  If so, figure out
+	 * initial stop column.
+	 */
+	if (max_d >= 0)
+	{
+		int			min_theo_d; /* Theoretical minimum distance. */
+		int			max_theo_d; /* Theoretical maximum distance. */
+		int			net_inserts = n - m;
+
+		min_theo_d = net_inserts < 0 ?
+			-net_inserts * del_c : net_inserts * ins_c;
+		if (min_theo_d > max_d)
+			return max_d + 1;
+		if (ins_c + del_c < sub_c)
+			sub_c = ins_c + del_c;
+		max_theo_d = min_theo_d + sub_c * Min(m, n);
+		if (max_d >= max_theo_d)
+			max_d = -1;
+		else if (ins_c + del_c > 0)
+		{
+			/*
+			 * Figure out how much of the first row of the notional matrix we
+			 * need to fill in.  If the string is growing, the theoretical
+			 * minimum distance already incorporates the cost of deleting the
+			 * number of characters necessary to make the two strings equal in
+			 * length.  Each additional deletion forces another insertion, so
+			 * the best-case total cost increases by ins_c + del_c. If the
+			 * string is shrinking, the minimum theoretical cost assumes no
+			 * excess deletions; that is, we're starting no further right than
+			 * column n - m.  If we do start further right, the best-case
+			 * total cost increases by ins_c + del_c for each move right.
+			 */
+			int			slack_d = max_d - min_theo_d;
+			int			best_column = net_inserts < 0 ? -net_inserts : 0;
+
+			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+			if (stop_column > m)
+				stop_column = m + 1;
+		}
+	}
+#endif
+
+	/*
+	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
+	 * we cache all the lengths before starting the main loop -- but if all
+	 * the characters in both strings are single byte, then we skip this and
+	 * use a fast-path in the main loop.  If only one string contains
+	 * multi-byte characters, we still build the array, so that the fast-path
+	 * needn't deal with the case where the array hasn't been initialized.
+	 */
+	if (m != slen || n != tlen)
+	{
+		int			i;
+		const char *cp = source;
+
+		s_char_len = (int *) palloc((m + 1) * sizeof(int));
+		for (i = 0; i < m; ++i)
+		{
+			s_char_len[i] = pg_mblen(cp);
+			cp += s_char_len[i];
+		}
+		s_char_len[i] = 0;
+	}
+
+	/* One more cell for initialization column and row. */
+	++m;
+	++n;
+
+	/* Previous and current rows of notional array. */
+	prev = (int *) palloc(2 * m * sizeof(int));
+	curr = prev + m;
+
+	/*
+	 * To transform the first i characters of s into the first 0 characters of
+	 * t, we must perform i deletions.
+	 */
+	for (i = START_COLUMN; i < STOP_COLUMN; i++)
+		prev[i] = i * del_c;
+
+	/* Loop through rows of the notional array */
+	for (y = target, j = 1; j < n; j++)
+	{
+		int		   *temp;
+		const char *x = source;
+		int			y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+		/*
+		 * In the best case, values percolate down the diagonal unchanged, so
+		 * we must increment stop_column unless it's already on the right end
+		 * of the array.  The inner loop will read prev[stop_column], so we
+		 * have to initialize it even though it shouldn't affect the result.
+		 */
+		if (stop_column < m)
+		{
+			prev[stop_column] = max_d + 1;
+			++stop_column;
+		}
+
+		/*
+		 * The main loop fills in curr, but curr[0] needs a special case: to
+		 * transform the first 0 characters of s into the first j characters
+		 * of t, we must perform j insertions.  However, if start_column > 0,
+		 * this special case does not apply.
+		 */
+		if (start_column == 0)
+		{
+			curr[0] = j * ins_c;
+			i = 1;
+		}
+		else
+			i = start_column;
+#else
+		curr[0] = j * ins_c;
+		i = 1;
+#endif
+
+		/*
+		 * This inner loop is critical to performance, so we include a
+		 * fast-path to handle the (fairly common) case where no multibyte
+		 * characters are in the mix.  The fast-path is entitled to assume
+		 * that if s_char_len is not initialized then BOTH strings contain
+		 * only single-byte characters.
+		 */
+		if (s_char_len != NULL)
+		{
+			for (; i < STOP_COLUMN; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+				int			x_char_len = s_char_len[i - 1];
+
+				/*
+				 * Calculate costs for insertion, deletion, and substitution.
+				 *
+				 * When calculating cost for substitution, we compare the last
+				 * character of each possibly-multibyte character first,
+				 * because that's enough to rule out most mis-matches.  If we
+				 * get past that test, then we compare the lengths and the
+				 * remaining bytes.
+				 */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				if (x[x_char_len - 1] == y[y_char_len - 1]
+					&& x_char_len == y_char_len &&
+					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+					sub = prev[i - 1];
+				else
+					sub = prev[i - 1] + sub_c;
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x += x_char_len;
+			}
+		}
+		else
+		{
+			for (; i < STOP_COLUMN; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+
+				/* Calculate costs for insertion, deletion, and substitution. */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x++;
+			}
+		}
+
+		/* Swap current row with previous row. */
+		temp = curr;
+		curr = prev;
+		prev = temp;
+
+		/* Point to next character. */
+		y += y_char_len;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+		/*
+		 * This chunk of code represents a significant performance hit if used
+		 * in the case where there is no max_d bound.  This is probably not
+		 * because the max_d >= 0 test itself is expensive, but rather because
+		 * the possibility of needing to execute this code prevents tight
+		 * optimization of the loop as a whole.
+		 */
+		if (max_d >= 0)
+		{
+			/*
+			 * The "zero point" is the column of the current row where the
+			 * remaining portions of the strings are of equal length.  There
+			 * are (n - 1) characters in the target string, of which j have
+			 * been transformed.  There are (m - 1) characters in the source
+			 * string, so we want to find the value for zp where (n - 1) - j =
+			 * (m - 1) - zp.
+			 */
+			int			zp = j - (n - m);
+
+			/* Check whether the stop column can slide left. */
+			while (stop_column > 0)
+			{
+				int			ii = stop_column - 1;
+				int			net_inserts = ii - zp;
+
+				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+								-net_inserts * del_c) <= max_d)
+					break;
+				stop_column--;
+			}
+
+			/* Check whether the start column can slide right. */
+			while (start_column < stop_column)
+			{
+				int			net_inserts = start_column - zp;
+
+				if (prev[start_column] +
+					(net_inserts > 0 ? net_inserts * ins_c :
+					 -net_inserts * del_c) <= max_d)
+					break;
+
+				/*
+				 * We'll never again update these values, so we must make sure
+				 * there's nothing here that could confuse any future
+				 * iteration of the outer loop.
+				 */
+				prev[start_column] = max_d + 1;
+				curr[start_column] = max_d + 1;
+				if (start_column != 0)
+					source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+				start_column++;
+			}
+
+			/* If they cross, we're going to exceed the bound. */
+			if (start_column >= stop_column)
+				return max_d + 1;
+		}
+#endif
+	}
+
+	/*
+	 * Because the final value was swapped from the previous row to the
+	 * current row, that's where we'll find it.
+	 */
+	return prev[m - 1];
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c3171b5..4b9e62a 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1546,6 +1546,31 @@ varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid)
 	return result;
 }
 
+/*
+ * varstr_leven()
+ * varstr_leven_less_equal()
+ * Levenshtein distance functions.  All arguments should be strlen(s) <= 255.
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
+ *
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+	while (len > 0)
+	{
+		len--;
+		if (s1[len] != s2[len])
+			return false;
+	}
+	return true;
+}
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
 
 /* text_cmp()
  * Internal comparison function for text strings.
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..c18157a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -35,7 +35,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, AttrNumber *matchedatt,
+				 int *distance);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 4e74d85..0abe9bf 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -786,6 +786,11 @@ extern Datum textoverlay_no_len(PG_FUNCTION_ARGS);
 extern Datum name_text(PG_FUNCTION_ARGS);
 extern Datum text_name(PG_FUNCTION_ARGS);
 extern int	varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid);
+extern int	varstr_leven(const char *source, int slen, const char *target,
+						 int tlen, int ins_c, int del_c, int sub_c);
+extern int	varstr_leven_less_equal(const char *source, int slen,
+									const char *target, int tlen, int ins_c,
+									int del_c, int sub_c, int max_d);
 extern List *textToQualifiedNameList(text *textval);
 extern bool SplitIdentifierString(char *rawstring, char separator,
 					  List **namelist);
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..b24fa43 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altstartwith;
                ^
+HINT:  Perhaps you meant to reference the column "altstartwith"."col".
 select * from altstartwith;
  col 
 -----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altinhoid;
                ^
+HINT:  Perhaps you meant to reference the column "altinhoid"."col".
 select * from altwithoid;
  col 
 -----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2501184..3ef5580 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3415,6 +3421,39 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 983f1b8..fb4abe6 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4782,6 +4782,7 @@ END$$;
 ERROR:  column "foo" does not exist
 LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
                                         ^
+HINT:  Perhaps you meant to reference the column "room"."roomno".
 QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
 CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
 -- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 88e7bfa..19a6e98 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname;  -- error
 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
                ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".
 -- same, but RECORD instead of named composite type:
 select cast (row('Jim', 'Beam') as text);
     row     
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c79b45c..01c80af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2396,6 +2396,7 @@ select xmin, * from fooview;  -- fail, views don't have such a column
 ERROR:  column "xmin" does not exist
 LINE 1: select xmin, * from fooview;
                ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".
 select reltoastrelid, relkind, relfrozenxid
   from pg_class where oid = 'fooview'::regclass;
  reltoastrelid | relkind | relfrozenxid 
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..fbff011 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
 ERROR:  column "oid" does not exist
 LINE 1: SELECT count(oid) FROM wo;
                      ^
+HINT:  Perhaps you meant to reference the column "wo"."i".
 VACUUM ANALYZE wi;
 VACUUM ANALYZE wo;
 SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 718e1d9..ca7f966 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1051,6 +1055,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
1.9.1

#77Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#76)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].

FWIW, I still find this bit of code that this patch adds in varlena.c ugly:
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL
-- 
Michael
#78Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#77)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sun, Nov 9, 2014 at 8:56 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

FWIW, I still find this bit of code that this patch adds in varlena.c ugly:

+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
+#undef LEVENSHTEIN_LESS_EQUAL

Okay, but this is the coding that currently appears within contrib's
fuzzystrmatch.c, more or less unchanged. The "#undef
LEVENSHTEIN_LESS_EQUAL" line that I added ought to be unnecessary.
I'll give the final word on that to whoever picks this up.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#76)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan <pg@heroku.com> wrote:

Reminder: I maintain a slight preference for only offering one
suggestion per relation RTE, which is what this revision does (so no
change there). If a committer who picks this up wants me to alter
that, I don't mind doing so; since only Michael spoke up on this, I've
kept things my way.

Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.
--
Michael

#80Peter Geoghegan
pg@heroku.com
In reply to: Michael Paquier (#79)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Hm. The last version of this patch has not really changed since since my
first review, and I have no more feedback to provide about it except what I
already mentioned. I honestly don't think that this patch is ready for
committer as-is... If someone wants to review it further, well extra
opinions I am sure are welcome.

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#80)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan <pg@heroku.com> wrote:

Why not?

You've already said that you're happy to defer to whatever committer
picks this up with regard to whether or not more than a single
suggestion can come from an RTE. I agreed with this (i.e. I said I'd
defer to their opinion too), and once again drew particular attention
to this state of affairs alongside my most recent revision.

What does that leave?

I see you've marked this "Needs Review", even though your previously
marked it "Ready for Committer" a few months back (Robert marked it
"Waiting on Author" very recently because of the compiler warning, and
then I marked it back to "Ready for Committer" once that was
addressed, before you finally marked it back to "Needs Review" and
removed yourself as the reviewer just now).

I'm pretty puzzled by this. Other than our "agree to disagree and
defer to committer" position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1]/messages/by-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com, I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2]: /messages/by-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com -- Peter Geoghegan

Did I miss something else?

[1]: /messages/by-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-BVqzgP0qA1Y_xDC_1g@mail.gmail.com
[2]: /messages/by-id/CA+TgmoYKiiq8MC0UJ5i5XfkTYBg1qqfN4YRCkZ60YDUnumkzzQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#76)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Based on this review from a month ago, I'm going to mark this Waiting
on Author. If nobody updates the patch in a few days, I'll mark it
Returned with Feedback. Thanks.

Attached revision fixes the compiler warning that Heikki complained
about. I maintain SQL-callable stub functions from within contrib,
rather than follow Michael's approach. In other words, very little has
changed from my revision from July last [1].

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs. I think you need to
change that if you want to get anywhere with this. On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting. I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#82)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs. I think you need to
change that if you want to get anywhere with this.

Alright. It wasn't as if I felt very strongly about it either way.

On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting. I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

Sure.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#83)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors. If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future. Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

Attached patch moves the Levenshtein distance implementation into core.

You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.

--
Peter Geoghegan

Attachments:

.0001-Move-Levenshtein-distance-implementation-into-core.patch.swpapplication/octet-stream; name=.0001-Move-Levenshtein-distance-implementation-into-core.patch.swpDownload
b0VIM 7.4S�cTRT\-{pghamster~pg/postgresql/0001-Move-Levenshtein-distance-implementation-into-core.patch
3210#"! Utp
h��������oi��������r���������rJ��������{���������Q7��������u���������~���������}{�ad
�h��m%$��Z! �
�
H


��g8���T�
�
�
x
S
.

�	�	�	�	o	?	���nI!����olZVL1	�����|R����^H2!���`<�����zwP)'������gJ-���+	s_bytes = VARSIZE_ANY_EXHDR(src);+	/* Determine length of each string in bytes and characters */+	t_data = VARDATA_ANY(dst);+	s_data = VARDATA_ANY(src);+	/* Extract a pointer to the actual character data */++				t_bytes;+	int			s_bytes,+	const char *t_data;+	const char *s_data;-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));- 	text	   *dst = PG_GETARG_TEXT_PP(1); 	text	   *src = PG_GETARG_TEXT_PP(0); {@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)   }+								 del_c, sub_c));+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,++	t_bytes = VARSIZE_ANY_EXHDR(dst);+	s_bytes = VARSIZE_ANY_EXHDR(src);+	/* Determine length of each string in bytes and characters */+	t_data = VARDATA_ANY(dst);+	s_data = VARDATA_ANY(src);+	/* Extract a pointer to the actual character data */++				t_bytes;+	int			s_bytes,+	const char *t_data;+	const char *s_data;-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));- 	int			sub_c = PG_GETARG_INT32(4); 	int			del_c = PG_GETARG_INT32(3); 	int			ins_c = PG_GETARG_INT32(2);@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS) levenshtein_with_costs(PG_FUNCTION_ARGS) Datum PG_FUNCTION_INFO_V1(levenshtein_with_costs);--#include "levenshtein.c"-#define LEVENSHTEIN_LESS_EQUAL-#include "levenshtein.c"--}-	return true;-	}-			return false;-		if (s1[len] != s2[len])-		len--;-	{-	while (len > 0)-{-rest_of_char_same(const char *s1, const char *s2, int len)-static inline bool-/* Faster than memcmp(), for this use case. */  #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */ /* These prevent GH from becoming F */@@ -154,23 +154,6 @@ getcode(char c)+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c--- a/contrib/fuzzystrmatch/fuzzystrmatch.cindex 7a53d8a..62e650f 100644diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c-# levenshtein.c is #included by fuzzystrmatch.c- endif include $(top_srcdir)/contrib/contrib-global.mk include $(top_builddir)/src/Makefile.global@@ -17,6 +17,3 @@ top_builddir = ../..+++ b/contrib/fuzzystrmatch/Makefile--- a/contrib/fuzzystrmatch/Makefileindex 024265d..0327d95 100644diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile create mode 100644 src/backend/utils/adt/levenshtein.c delete mode 100644 contrib/fuzzystrmatch/levenshtein.c 7 files changed, 481 insertions(+), 431 deletions(-) src/include/utils/builtins.h          |   5 + src/backend/utils/adt/varlena.c       |  24 ++ src/backend/utils/adt/levenshtein.c   | 394 +++++++++++++++++++++++++++++++++ src/backend/utils/adt/Makefile        |   2 + contrib/fuzzystrmatch/levenshtein.c   | 403 ---------------------------------- contrib/fuzzystrmatch/fuzzystrmatch.c |  81 ++++--- contrib/fuzzystrmatch/Makefile        |   3 ----helpful in building diagnostic messages.An in-core Levenshtein distance implementation is only anticipated to beinto core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction.It is not anticipated that the user-facing SQL functions will be movedfuzzystmatch definitions become simple forwarding stubs.only be used with the fuzzystmatch extension installed -- the/contrib to core.  However, the related SQL-callable functions may stillThe fuzzystmatch Levenshtein distance implementation is moved fromSubject: [PATCH 1/2] Move Levenshtein distance implementation into coreDate: Sat, 30 Nov 2013 23:15:00 -0800From: Peter Geoghegan <pg@heroku.com>From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001ad�0
��j?��p.�
�
�
U
;
7
1
0
/
1.9.1--  					  List **namelist); extern bool SplitIdentifierString(char *rawstring, char separator, extern List *textToQualifiedNameList(text *textval);+									int del_c, int sub_c, int max_d);+									const char *target, int tlen, int ins_c,+extern int	varstr_leven_less_equal(const char *source, int slen,+						 int tlen, int ins_c, int del_c, int sub_c);+extern int	varstr_leven(const char *source, int slen, const char *target, extern int	varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid); extern Datum text_name(PG_FUNCTION_ARGS); extern Datum name_text(PG_FUNCTION_ARGS);@@ -786,6 +786,11 @@ extern Datum textoverlay_no_len(PG_FUNCTION_ARGS);+++ b/src/include/utils/builtins.h
#85Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#84)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file. :-)

Here is the actual patch.

--
Peter Geoghegan

Attachments:

0001-Move-Levenshtein-distance-implementation-into-core.patchtext/x-patch; charset=US-ASCII; name=0001-Move-Levenshtein-distance-implementation-into-core.patchDownload
From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH 1/2] Move Levenshtein distance implementation into core

The fuzzystmatch Levenshtein distance implementation is moved from
/contrib to core.  However, the related SQL-callable functions may still
only be used with the fuzzystmatch extension installed -- the
fuzzystmatch definitions become simple forwarding stubs.

It is not anticipated that the user-facing SQL functions will be moved
into core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction.
An in-core Levenshtein distance implementation is only anticipated to be
helpful in building diagnostic messages.
---
 contrib/fuzzystrmatch/Makefile        |   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 ++++---
 contrib/fuzzystrmatch/levenshtein.c   | 403 ----------------------------------
 src/backend/utils/adt/Makefile        |   2 +
 src/backend/utils/adt/levenshtein.c   | 394 +++++++++++++++++++++++++++++++++
 src/backend/utils/adt/varlena.c       |  24 ++
 src/include/utils/builtins.h          |   5 +
 7 files changed, 481 insertions(+), 431 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len > 0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
 	int			ins_c = PG_GETARG_INT32(2);
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+								 del_c, sub_c));
 }
 
 
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
 {
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+									   1));
 }
 
 
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
 	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+											ins_c, del_c, sub_c, max_d));
 }
 
 
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
 	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+				t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+											1, 1, 1, max_d));
 }
 
 
diff --git a/contrib/fuzzystrmatch/levenshtein.c b/contrib/fuzzystrmatch/levenshtein.c
deleted file mode 100644
index 4f37a54..0000000
--- a/contrib/fuzzystrmatch/levenshtein.c
+++ /dev/null
@@ -1,403 +0,0 @@
-/*
- * levenshtein.c
- *
- * Functions for "fuzzy" comparison of strings
- *
- * Joe Conway <mail@joeconway.com>
- *
- * Copyright (c) 2001-2014, PostgreSQL Global Development Group
- * ALL RIGHTS RESERVED;
- *
- * levenshtein()
- * -------------
- * Written based on a description of the algorithm by Michael Gilleland
- * found at http://www.merriampark.com/ld.htm
- * Also looked at levenshtein.c in the PHP 4.0.6 distribution for
- * inspiration.
- * Configurable penalty costs extension is introduced by Volkan
- * YAZICI <volkan.yazici@gmail.com>.
- */
-
-/*
- * External declarations for exported functions
- */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-static int levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d);
-#else
-static int levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c);
-#endif
-
-#define MAX_LEVENSHTEIN_STRLEN		255
-
-
-/*
- * Calculates Levenshtein distance metric between supplied strings. Generally
- * (1, 1, 1) penalty costs suffices for common cases, but your mileage may
- * vary.
- *
- * One way to compute Levenshtein distance is to incrementally construct
- * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
- * of operations required to transform the first i characters of s into
- * the first j characters of t.  The last column of the final row is the
- * answer.
- *
- * We use that algorithm here with some modification.  In lieu of holding
- * the entire array in memory at once, we'll just use two arrays of size
- * m+1 for storing accumulated values. At each step one array represents
- * the "previous" row and one is the "current" row of the notional large
- * array.
- *
- * If max_d >= 0, we only need to provide an accurate answer when that answer
- * is less than or equal to the bound.  From any cell in the matrix, there is
- * theoretical "minimum residual distance" from that cell to the last column
- * of the final row.  This minimum residual distance is zero when the
- * untransformed portions of the strings are of equal length (because we might
- * get lucky and find all the remaining characters matching) and is otherwise
- * based on the minimum number of insertions or deletions needed to make them
- * equal length.  The residual distance grows as we move toward the upper
- * right or lower left corners of the matrix.  When the max_d bound is
- * usefully tight, we can use this property to avoid computing the entirety
- * of each row; instead, we maintain a start_column and stop_column that
- * identify the portion of the matrix close to the diagonal which can still
- * affect the final answer.
- */
-static int
-#ifdef LEVENSHTEIN_LESS_EQUAL
-levenshtein_less_equal_internal(text *s, text *t,
-								int ins_c, int del_c, int sub_c, int max_d)
-#else
-levenshtein_internal(text *s, text *t,
-					 int ins_c, int del_c, int sub_c)
-#endif
-{
-	int			m,
-				n,
-				s_bytes,
-				t_bytes;
-	int		   *prev;
-	int		   *curr;
-	int		   *s_char_len = NULL;
-	int			i,
-				j;
-	const char *s_data;
-	const char *t_data;
-	const char *y;
-
-	/*
-	 * For levenshtein_less_equal_internal, we have real variables called
-	 * start_column and stop_column; otherwise it's just short-hand for 0 and
-	 * m.
-	 */
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	int			start_column,
-				stop_column;
-
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN start_column
-#define STOP_COLUMN stop_column
-#else
-#undef START_COLUMN
-#undef STOP_COLUMN
-#define START_COLUMN 0
-#define STOP_COLUMN m
-#endif
-
-	/* Extract a pointer to the actual character data. */
-	s_data = VARDATA_ANY(s);
-	t_data = VARDATA_ANY(t);
-
-	/* Determine length of each string in bytes and characters. */
-	s_bytes = VARSIZE_ANY_EXHDR(s);
-	t_bytes = VARSIZE_ANY_EXHDR(t);
-	m = pg_mbstrlen_with_len(s_data, s_bytes);
-	n = pg_mbstrlen_with_len(t_data, t_bytes);
-
-	/*
-	 * We can transform an empty s into t with n insertions, or a non-empty t
-	 * into an empty s with m deletions.
-	 */
-	if (!m)
-		return n * ins_c;
-	if (!n)
-		return m * del_c;
-
-	/*
-	 * For security concerns, restrict excessive CPU+RAM usage. (This
-	 * implementation uses O(m) memory and has O(mn) complexity.)
-	 */
-	if (m > MAX_LEVENSHTEIN_STRLEN ||
-		n > MAX_LEVENSHTEIN_STRLEN)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("argument exceeds the maximum length of %d bytes",
-						MAX_LEVENSHTEIN_STRLEN)));
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-	/* Initialize start and stop columns. */
-	start_column = 0;
-	stop_column = m + 1;
-
-	/*
-	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
-	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
-	 * enough to limit the computation we must perform.  If so, figure out
-	 * initial stop column.
-	 */
-	if (max_d >= 0)
-	{
-		int			min_theo_d; /* Theoretical minimum distance. */
-		int			max_theo_d; /* Theoretical maximum distance. */
-		int			net_inserts = n - m;
-
-		min_theo_d = net_inserts < 0 ?
-			-net_inserts * del_c : net_inserts * ins_c;
-		if (min_theo_d > max_d)
-			return max_d + 1;
-		if (ins_c + del_c < sub_c)
-			sub_c = ins_c + del_c;
-		max_theo_d = min_theo_d + sub_c * Min(m, n);
-		if (max_d >= max_theo_d)
-			max_d = -1;
-		else if (ins_c + del_c > 0)
-		{
-			/*
-			 * Figure out how much of the first row of the notional matrix we
-			 * need to fill in.  If the string is growing, the theoretical
-			 * minimum distance already incorporates the cost of deleting the
-			 * number of characters necessary to make the two strings equal in
-			 * length.  Each additional deletion forces another insertion, so
-			 * the best-case total cost increases by ins_c + del_c. If the
-			 * string is shrinking, the minimum theoretical cost assumes no
-			 * excess deletions; that is, we're starting no further right than
-			 * column n - m.  If we do start further right, the best-case
-			 * total cost increases by ins_c + del_c for each move right.
-			 */
-			int			slack_d = max_d - min_theo_d;
-			int			best_column = net_inserts < 0 ? -net_inserts : 0;
-
-			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
-			if (stop_column > m)
-				stop_column = m + 1;
-		}
-	}
-#endif
-
-	/*
-	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
-	 * we cache all the lengths before starting the main loop -- but if all
-	 * the characters in both strings are single byte, then we skip this and
-	 * use a fast-path in the main loop.  If only one string contains
-	 * multi-byte characters, we still build the array, so that the fast-path
-	 * needn't deal with the case where the array hasn't been initialized.
-	 */
-	if (m != s_bytes || n != t_bytes)
-	{
-		int			i;
-		const char *cp = s_data;
-
-		s_char_len = (int *) palloc((m + 1) * sizeof(int));
-		for (i = 0; i < m; ++i)
-		{
-			s_char_len[i] = pg_mblen(cp);
-			cp += s_char_len[i];
-		}
-		s_char_len[i] = 0;
-	}
-
-	/* One more cell for initialization column and row. */
-	++m;
-	++n;
-
-	/* Previous and current rows of notional array. */
-	prev = (int *) palloc(2 * m * sizeof(int));
-	curr = prev + m;
-
-	/*
-	 * To transform the first i characters of s into the first 0 characters of
-	 * t, we must perform i deletions.
-	 */
-	for (i = START_COLUMN; i < STOP_COLUMN; i++)
-		prev[i] = i * del_c;
-
-	/* Loop through rows of the notional array */
-	for (y = t_data, j = 1; j < n; j++)
-	{
-		int		   *temp;
-		const char *x = s_data;
-		int			y_char_len = n != t_bytes + 1 ? pg_mblen(y) : 1;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * In the best case, values percolate down the diagonal unchanged, so
-		 * we must increment stop_column unless it's already on the right end
-		 * of the array.  The inner loop will read prev[stop_column], so we
-		 * have to initialize it even though it shouldn't affect the result.
-		 */
-		if (stop_column < m)
-		{
-			prev[stop_column] = max_d + 1;
-			++stop_column;
-		}
-
-		/*
-		 * The main loop fills in curr, but curr[0] needs a special case: to
-		 * transform the first 0 characters of s into the first j characters
-		 * of t, we must perform j insertions.  However, if start_column > 0,
-		 * this special case does not apply.
-		 */
-		if (start_column == 0)
-		{
-			curr[0] = j * ins_c;
-			i = 1;
-		}
-		else
-			i = start_column;
-#else
-		curr[0] = j * ins_c;
-		i = 1;
-#endif
-
-		/*
-		 * This inner loop is critical to performance, so we include a
-		 * fast-path to handle the (fairly common) case where no multibyte
-		 * characters are in the mix.  The fast-path is entitled to assume
-		 * that if s_char_len is not initialized then BOTH strings contain
-		 * only single-byte characters.
-		 */
-		if (s_char_len != NULL)
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-				int			x_char_len = s_char_len[i - 1];
-
-				/*
-				 * Calculate costs for insertion, deletion, and substitution.
-				 *
-				 * When calculating cost for substitution, we compare the last
-				 * character of each possibly-multibyte character first,
-				 * because that's enough to rule out most mis-matches.  If we
-				 * get past that test, then we compare the lengths and the
-				 * remaining bytes.
-				 */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				if (x[x_char_len - 1] == y[y_char_len - 1]
-					&& x_char_len == y_char_len &&
-					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
-					sub = prev[i - 1];
-				else
-					sub = prev[i - 1] + sub_c;
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x += x_char_len;
-			}
-		}
-		else
-		{
-			for (; i < STOP_COLUMN; i++)
-			{
-				int			ins;
-				int			del;
-				int			sub;
-
-				/* Calculate costs for insertion, deletion, and substitution. */
-				ins = prev[i] + ins_c;
-				del = curr[i - 1] + del_c;
-				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
-
-				/* Take the one with minimum cost. */
-				curr[i] = Min(ins, del);
-				curr[i] = Min(curr[i], sub);
-
-				/* Point to next character. */
-				x++;
-			}
-		}
-
-		/* Swap current row with previous row. */
-		temp = curr;
-		curr = prev;
-		prev = temp;
-
-		/* Point to next character. */
-		y += y_char_len;
-
-#ifdef LEVENSHTEIN_LESS_EQUAL
-
-		/*
-		 * This chunk of code represents a significant performance hit if used
-		 * in the case where there is no max_d bound.  This is probably not
-		 * because the max_d >= 0 test itself is expensive, but rather because
-		 * the possibility of needing to execute this code prevents tight
-		 * optimization of the loop as a whole.
-		 */
-		if (max_d >= 0)
-		{
-			/*
-			 * The "zero point" is the column of the current row where the
-			 * remaining portions of the strings are of equal length.  There
-			 * are (n - 1) characters in the target string, of which j have
-			 * been transformed.  There are (m - 1) characters in the source
-			 * string, so we want to find the value for zp where (n - 1) - j =
-			 * (m - 1) - zp.
-			 */
-			int			zp = j - (n - m);
-
-			/* Check whether the stop column can slide left. */
-			while (stop_column > 0)
-			{
-				int			ii = stop_column - 1;
-				int			net_inserts = ii - zp;
-
-				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
-								-net_inserts * del_c) <= max_d)
-					break;
-				stop_column--;
-			}
-
-			/* Check whether the start column can slide right. */
-			while (start_column < stop_column)
-			{
-				int			net_inserts = start_column - zp;
-
-				if (prev[start_column] +
-					(net_inserts > 0 ? net_inserts * ins_c :
-					 -net_inserts * del_c) <= max_d)
-					break;
-
-				/*
-				 * We'll never again update these values, so we must make sure
-				 * there's nothing here that could confuse any future
-				 * iteration of the outer loop.
-				 */
-				prev[start_column] = max_d + 1;
-				curr[start_column] = max_d + 1;
-				if (start_column != 0)
-					s_data += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
-				start_column++;
-			}
-
-			/* If they cross, we're going to exceed the bound. */
-			if (start_column >= stop_column)
-				return max_d + 1;
-		}
-#endif
-	}
-
-	/*
-	 * Because the final value was swapped from the previous row to the
-	 * current row, that's where we'll find it.
-	 */
-	return prev[m - 1];
-}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 7b4391b..3ea9bf4 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -38,4 +38,6 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 
 like.o: like.c like_match.c
 
+varlena.o: varlena.c levenshtein.c
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
new file mode 100644
index 0000000..bf4f1dd
--- /dev/null
+++ b/src/backend/utils/adt/levenshtein.c
@@ -0,0 +1,394 @@
+/*-------------------------------------------------------------------------
+ *
+ * levenshtein.c
+ *	  Levenshtein distance implementation.
+ *
+ * Original author:  Joe Conway <mail@joeconway.com>
+ *
+ * This file is included by varlena.c twice, to provide matching code for (1)
+ * Levenshtein distance with custom costings, and (2) Levenshtein distance with
+ * custom costings and a "max" value above which exact distances are not
+ * interesting.  Before the inclusion, we rely on the presence of the inline
+ * function rest_of_char_same().
+ *
+ * Written based on a description of the algorithm by Michael Gilleland found
+ * at http://www.merriampark.com/ld.htm.  Also looked at levenshtein.c in the
+ * PHP 4.0.6 distribution for inspiration.  Configurable penalty costs
+ * extension is introduced by Volkan YAZICI <volkan.yazici@gmail.com.
+ *
+ * Copyright (c) 2001-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/backend/utils/adt/levenshtein.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#define MAX_LEVENSHTEIN_STRLEN		255
+
+/*
+ * Calculates Levenshtein distance metric between supplied csrings, which are
+ * not necessarily null-terminated.  Generally (1, 1, 1) penalty costs suffices
+ * for common cases, but your mileage may vary.
+ *
+ * One way to compute Levenshtein distance is to incrementally construct
+ * an (m+1)x(n+1) matrix where cell (i, j) represents the minimum number
+ * of operations required to transform the first i characters of s into
+ * the first j characters of t.  The last column of the final row is the
+ * answer.
+ *
+ * We use that algorithm here with some modification.  In lieu of holding
+ * the entire array in memory at once, we'll just use two arrays of size
+ * m+1 for storing accumulated values. At each step one array represents
+ * the "previous" row and one is the "current" row of the notional large
+ * array.
+ *
+ * If max_d >= 0, we only need to provide an accurate answer when that answer
+ * is less than or equal to the bound.  From any cell in the matrix, there is
+ * theoretical "minimum residual distance" from that cell to the last column
+ * of the final row.  This minimum residual distance is zero when the
+ * untransformed portions of the strings are of equal length (because we might
+ * get lucky and find all the remaining characters matching) and is otherwise
+ * based on the minimum number of insertions or deletions needed to make them
+ * equal length.  The residual distance grows as we move toward the upper
+ * right or lower left corners of the matrix.  When the max_d bound is
+ * usefully tight, we can use this property to avoid computing the entirety
+ * of each row; instead, we maintain a start_column and stop_column that
+ * identify the portion of the matrix close to the diagonal which can still
+ * affect the final answer.
+ */
+int
+#ifdef LEVENSHTEIN_LESS_EQUAL
+varstr_leven_less_equal(const char *source, int slen, const char *target,
+						int tlen, int ins_c, int del_c, int sub_c, int max_d)
+#else
+varstr_leven(const char *source, int slen, const char *target, int tlen,
+			 int ins_c, int del_c, int sub_c)
+#endif
+{
+	int			m,
+				n;
+	int		   *prev;
+	int		   *curr;
+	int		   *s_char_len = NULL;
+	int			i,
+				j;
+	const char *y;
+
+	/*
+	 * For varstr_levenshtein_less_equal, we have real variables called
+	 * start_column and stop_column; otherwise it's just short-hand for 0 and
+	 * m.
+	 */
+#ifdef LEVENSHTEIN_LESS_EQUAL
+	int			start_column,
+				stop_column;
+
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN start_column
+#define STOP_COLUMN stop_column
+#else
+#undef START_COLUMN
+#undef STOP_COLUMN
+#define START_COLUMN 0
+#define STOP_COLUMN m
+#endif
+
+	/*
+	 * A common use for Levenshtein distance is to match attributes when
+	 * building diagnostic, user-visible messages.  Restrict the size of
+	 * MAX_LEVENSHTEIN_STRLEN at compile time such that this is guaranteed to
+	 * work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
+	m = pg_mbstrlen_with_len(source, slen);
+	n = pg_mbstrlen_with_len(target, tlen);
+
+	/*
+	 * We can transform an empty s into t with n insertions, or a non-empty t
+	 * into an empty s with m deletions.
+	 */
+	if (!m)
+		return n * ins_c;
+	if (!n)
+		return m * del_c;
+
+	/*
+	 * For security concerns, restrict excessive CPU+RAM usage. (This
+	 * implementation uses O(m) memory and has O(mn) complexity.)
+	 */
+	if (m > MAX_LEVENSHTEIN_STRLEN ||
+		n > MAX_LEVENSHTEIN_STRLEN)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("argument exceeds the maximum length of %d bytes",
+						MAX_LEVENSHTEIN_STRLEN)));
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+	/* Initialize start and stop columns. */
+	start_column = 0;
+	stop_column = m + 1;
+
+	/*
+	 * If max_d >= 0, determine whether the bound is impossibly tight.  If so,
+	 * return max_d + 1 immediately.  Otherwise, determine whether it's tight
+	 * enough to limit the computation we must perform.  If so, figure out
+	 * initial stop column.
+	 */
+	if (max_d >= 0)
+	{
+		int			min_theo_d; /* Theoretical minimum distance. */
+		int			max_theo_d; /* Theoretical maximum distance. */
+		int			net_inserts = n - m;
+
+		min_theo_d = net_inserts < 0 ?
+			-net_inserts * del_c : net_inserts * ins_c;
+		if (min_theo_d > max_d)
+			return max_d + 1;
+		if (ins_c + del_c < sub_c)
+			sub_c = ins_c + del_c;
+		max_theo_d = min_theo_d + sub_c * Min(m, n);
+		if (max_d >= max_theo_d)
+			max_d = -1;
+		else if (ins_c + del_c > 0)
+		{
+			/*
+			 * Figure out how much of the first row of the notional matrix we
+			 * need to fill in.  If the string is growing, the theoretical
+			 * minimum distance already incorporates the cost of deleting the
+			 * number of characters necessary to make the two strings equal in
+			 * length.  Each additional deletion forces another insertion, so
+			 * the best-case total cost increases by ins_c + del_c. If the
+			 * string is shrinking, the minimum theoretical cost assumes no
+			 * excess deletions; that is, we're starting no further right than
+			 * column n - m.  If we do start further right, the best-case
+			 * total cost increases by ins_c + del_c for each move right.
+			 */
+			int			slack_d = max_d - min_theo_d;
+			int			best_column = net_inserts < 0 ? -net_inserts : 0;
+
+			stop_column = best_column + (slack_d / (ins_c + del_c)) + 1;
+			if (stop_column > m)
+				stop_column = m + 1;
+		}
+	}
+#endif
+
+	/*
+	 * In order to avoid calling pg_mblen() repeatedly on each character in s,
+	 * we cache all the lengths before starting the main loop -- but if all
+	 * the characters in both strings are single byte, then we skip this and
+	 * use a fast-path in the main loop.  If only one string contains
+	 * multi-byte characters, we still build the array, so that the fast-path
+	 * needn't deal with the case where the array hasn't been initialized.
+	 */
+	if (m != slen || n != tlen)
+	{
+		int			i;
+		const char *cp = source;
+
+		s_char_len = (int *) palloc((m + 1) * sizeof(int));
+		for (i = 0; i < m; ++i)
+		{
+			s_char_len[i] = pg_mblen(cp);
+			cp += s_char_len[i];
+		}
+		s_char_len[i] = 0;
+	}
+
+	/* One more cell for initialization column and row. */
+	++m;
+	++n;
+
+	/* Previous and current rows of notional array. */
+	prev = (int *) palloc(2 * m * sizeof(int));
+	curr = prev + m;
+
+	/*
+	 * To transform the first i characters of s into the first 0 characters of
+	 * t, we must perform i deletions.
+	 */
+	for (i = START_COLUMN; i < STOP_COLUMN; i++)
+		prev[i] = i * del_c;
+
+	/* Loop through rows of the notional array */
+	for (y = target, j = 1; j < n; j++)
+	{
+		int		   *temp;
+		const char *x = source;
+		int			y_char_len = n != tlen + 1 ? pg_mblen(y) : 1;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+		/*
+		 * In the best case, values percolate down the diagonal unchanged, so
+		 * we must increment stop_column unless it's already on the right end
+		 * of the array.  The inner loop will read prev[stop_column], so we
+		 * have to initialize it even though it shouldn't affect the result.
+		 */
+		if (stop_column < m)
+		{
+			prev[stop_column] = max_d + 1;
+			++stop_column;
+		}
+
+		/*
+		 * The main loop fills in curr, but curr[0] needs a special case: to
+		 * transform the first 0 characters of s into the first j characters
+		 * of t, we must perform j insertions.  However, if start_column > 0,
+		 * this special case does not apply.
+		 */
+		if (start_column == 0)
+		{
+			curr[0] = j * ins_c;
+			i = 1;
+		}
+		else
+			i = start_column;
+#else
+		curr[0] = j * ins_c;
+		i = 1;
+#endif
+
+		/*
+		 * This inner loop is critical to performance, so we include a
+		 * fast-path to handle the (fairly common) case where no multibyte
+		 * characters are in the mix.  The fast-path is entitled to assume
+		 * that if s_char_len is not initialized then BOTH strings contain
+		 * only single-byte characters.
+		 */
+		if (s_char_len != NULL)
+		{
+			for (; i < STOP_COLUMN; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+				int			x_char_len = s_char_len[i - 1];
+
+				/*
+				 * Calculate costs for insertion, deletion, and substitution.
+				 *
+				 * When calculating cost for substitution, we compare the last
+				 * character of each possibly-multibyte character first,
+				 * because that's enough to rule out most mis-matches.  If we
+				 * get past that test, then we compare the lengths and the
+				 * remaining bytes.
+				 */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				if (x[x_char_len - 1] == y[y_char_len - 1]
+					&& x_char_len == y_char_len &&
+					(x_char_len == 1 || rest_of_char_same(x, y, x_char_len)))
+					sub = prev[i - 1];
+				else
+					sub = prev[i - 1] + sub_c;
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x += x_char_len;
+			}
+		}
+		else
+		{
+			for (; i < STOP_COLUMN; i++)
+			{
+				int			ins;
+				int			del;
+				int			sub;
+
+				/* Calculate costs for insertion, deletion, and substitution. */
+				ins = prev[i] + ins_c;
+				del = curr[i - 1] + del_c;
+				sub = prev[i - 1] + ((*x == *y) ? 0 : sub_c);
+
+				/* Take the one with minimum cost. */
+				curr[i] = Min(ins, del);
+				curr[i] = Min(curr[i], sub);
+
+				/* Point to next character. */
+				x++;
+			}
+		}
+
+		/* Swap current row with previous row. */
+		temp = curr;
+		curr = prev;
+		prev = temp;
+
+		/* Point to next character. */
+		y += y_char_len;
+
+#ifdef LEVENSHTEIN_LESS_EQUAL
+
+		/*
+		 * This chunk of code represents a significant performance hit if used
+		 * in the case where there is no max_d bound.  This is probably not
+		 * because the max_d >= 0 test itself is expensive, but rather because
+		 * the possibility of needing to execute this code prevents tight
+		 * optimization of the loop as a whole.
+		 */
+		if (max_d >= 0)
+		{
+			/*
+			 * The "zero point" is the column of the current row where the
+			 * remaining portions of the strings are of equal length.  There
+			 * are (n - 1) characters in the target string, of which j have
+			 * been transformed.  There are (m - 1) characters in the source
+			 * string, so we want to find the value for zp where (n - 1) - j =
+			 * (m - 1) - zp.
+			 */
+			int			zp = j - (n - m);
+
+			/* Check whether the stop column can slide left. */
+			while (stop_column > 0)
+			{
+				int			ii = stop_column - 1;
+				int			net_inserts = ii - zp;
+
+				if (prev[ii] + (net_inserts > 0 ? net_inserts * ins_c :
+								-net_inserts * del_c) <= max_d)
+					break;
+				stop_column--;
+			}
+
+			/* Check whether the start column can slide right. */
+			while (start_column < stop_column)
+			{
+				int			net_inserts = start_column - zp;
+
+				if (prev[start_column] +
+					(net_inserts > 0 ? net_inserts * ins_c :
+					 -net_inserts * del_c) <= max_d)
+					break;
+
+				/*
+				 * We'll never again update these values, so we must make sure
+				 * there's nothing here that could confuse any future
+				 * iteration of the outer loop.
+				 */
+				prev[start_column] = max_d + 1;
+				curr[start_column] = max_d + 1;
+				if (start_column != 0)
+					source += (s_char_len != NULL) ? s_char_len[start_column - 1] : 1;
+				start_column++;
+			}
+
+			/* If they cross, we're going to exceed the bound. */
+			if (start_column >= stop_column)
+				return max_d + 1;
+		}
+#endif
+	}
+
+	/*
+	 * Because the final value was swapped from the previous row to the
+	 * current row, that's where we'll find it.
+	 */
+	return prev[m - 1];
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c3171b5..48afc61 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1546,6 +1546,30 @@ varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid)
 	return result;
 }
 
+/*
+ * varstr_leven()
+ * varstr_leven_less_equal()
+ * Levenshtein distance functions.  All arguments should be strlen(s) <= 255.
+ * Guaranteed to work with Name datatype's cstrings.
+ * For full details see levenshtein.c.
+ *
+ * Helper function. Faster than memcmp(), for this use case.
+ */
+static inline bool
+rest_of_char_same(const char *s1, const char *s2, int len)
+{
+	while (len > 0)
+	{
+		len--;
+		if (s1[len] != s2[len])
+			return false;
+	}
+	return true;
+}
+/* Expand each Levenshtein distance variant */
+#include "levenshtein.c"
+#define LEVENSHTEIN_LESS_EQUAL
+#include "levenshtein.c"
 
 /* text_cmp()
  * Internal comparison function for text strings.
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 3ba34f8..7298c93 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -786,6 +786,11 @@ extern Datum textoverlay_no_len(PG_FUNCTION_ARGS);
 extern Datum name_text(PG_FUNCTION_ARGS);
 extern Datum text_name(PG_FUNCTION_ARGS);
 extern int	varstr_cmp(char *arg1, int len1, char *arg2, int len2, Oid collid);
+extern int	varstr_leven(const char *source, int slen, const char *target,
+						 int tlen, int ins_c, int del_c, int sub_c);
+extern int	varstr_leven_less_equal(const char *source, int slen,
+									const char *target, int tlen, int ins_c,
+									int del_c, int sub_c, int max_d);
 extern List *textToQualifiedNameList(text *textval);
 extern bool SplitIdentifierString(char *rawstring, char separator,
 					  List **namelist);
-- 
1.9.1

#86Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#81)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'm pretty puzzled by this. Other than our "agree to disagree and
defer to committer" position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].
Did I miss something else?

My point is: I am not sure I can be defined as a reviewer of this
patch or take any credit in this patch review knowing that the latest
version submitted is a simple rebase of the version I did my first
review on. Hence, code speaking, this patch is in the same state as
when it has been firstly submitted.
Thanks,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#86)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 10:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'm pretty puzzled by this. Other than our "agree to disagree and
defer to committer" position on the question of whether or not more
than one suggestion can come from a single RTE, which you were fine
with before [1], I have only restored the core/contrib separation to a
state recently suggested by Robert as the best and simplest all around
[2].
Did I miss something else?

My point is: I am not sure I can be defined as a reviewer of this
patch or take any credit in this patch review knowing that the latest
version submitted is a simple rebase of the version I did my first
review on. Hence, code speaking, this patch is in the same state as
when it has been firstly submitted.

Of course you can. Time spent reviewing is time spent reviewing,
whether it results in changes to the patch or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#87)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 13, 2014 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

My point is: I am not sure I can be defined as a reviewer of this
patch or take any credit in this patch review knowing that the latest
version submitted is a simple rebase of the version I did my first
review on. Hence, code speaking, this patch is in the same state as
when it has been firstly submitted.

Of course you can. Time spent reviewing is time spent reviewing,
whether it results in changes to the patch or not.

My thoughts exactly. I thought Michael did a good job, even if I
didn't agree with everything he said.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#85)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 8:00 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file. :-)

Here is the actual patch.

Committed. I changed varstr_leven() to varstr_levenshtein() because
abbrvs cn mk the code hrd to undstnd. And to grep. And I removed the
StaticAssertStmt you added, because it's not actually used for
anything that necessitates that, yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#89)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 13, 2014 at 9:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. I changed varstr_leven() to varstr_levenshtein() because
abbrvs cn mk the code hrd to undstnd. And to grep.

Thanks. I'll produce a revision of patch 2/2 soon.

And I removed the
StaticAssertStmt you added, because it's not actually used for
anything that necessitates that, yet.

I'll add it back in in patch 2/2, so.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#82)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs. I think you need to
change that if you want to get anywhere with this. On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting. I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

Attached patch incorporates this feedback.

The only user-visible difference between this revision and the
previous revision is that it's quite possible for two suggestion to
originate from the same RTE (there is exactly one change in the
regression test's expected output as compared to the last revision for
this reason. The regression tests are otherwise unchanged). It's still
not possible to see more than 2 suggestions under any circumstances,
no matter where they might have originated from, which I think is
appropriate -- we continue to not present any HINT in the event of 3
or more equidistant matches.

I think that the restructuring required to pass around a state
variable has resulted in somewhat clearer code.

--
Peter Geoghegan

Attachments:

0001-Levenshtein-distance-column-HINT.patchtext/x-patch; charset=US-ASCII; name=0001-Levenshtein-distance-column-HINT.patchDownload
From 0aef5253f10ebb1ee5bbcc73782eff1352c7ab84 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 345 +++++++++++++++++++++++++++---
 src/backend/utils/adt/levenshtein.c       |   9 +
 src/include/parser/parse_relation.h       |  20 +-
 src/test/regress/expected/alter_table.out |   8 +
 src/test/regress/expected/join.out        |  39 ++++
 src/test/regress/expected/plpgsql.out     |   1 +
 src/test/regress/expected/rowtypes.out    |   1 +
 src/test/regress/expected/rules.out       |   1 +
 src/test/regress/expected/without_oid.out |   2 +
 src/test/regress/sql/join.sql             |  24 +++
 12 files changed, 421 insertions(+), 40 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..40c69d7 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -520,6 +521,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * distanceName
+ *	  Return Levenshtein distance between an actual column name and possible
+ *	  partial match.
+ */
+static int
+distanceName(const char *actual, const char *match, int max)
+{
+	int len = strlen(actual),
+		match_len = strlen(match);
+
+	/* Charge half as much per deletion as per insertion or per substitution */
+	return varstr_levenshtein_less_equal(actual, len, match, match_len,
+										 2, 1, 2, max);
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +544,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname.  Such callers should not rely
+ * on the return value (even when there is an exact match), nor should they
+ * expect the usual side effect (unless there is an exact match).  This hardly
+ * matters in practice, since an error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, or if there are no matches, accurately report the shortest distance
+ * found overall while not setting a closest attribute.  Note that we never
+ * consider system column names when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, FuzzyAttrMatchState *rtestate)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +577,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +599,49 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		if (rtestate)
+		{
+			int				columndistance;
+
+			if (!result)
+				columndistance = distanceName(attcolname, colname,
+											  rtestate->distance);
+			else
+				columndistance = 0;
+
+			if (columndistance < rtestate->distance)
+			{
+				/* Store new lowest observed distance for RTE */
+				rtestate->distance = columndistance;
+				rtestate->first = attnum;
+				rtestate->second = InvalidAttrNumber;
+			}
+			else if (columndistance == rtestate->distance)
+			{
+				/*
+				 * This match distance may equal a prior match within this same
+				 * range table.  When that happens, the prior match may also be
+				 * given, but only if there is no more than two equally distant
+				 * matches from the RTE (in turn, our caller will only accept
+				 * two equally distant matches overall).
+				 */
+				Assert(AttributeNumberIsValid(rtestate->first));
+
+				if (AttributeNumberIsValid(rtestate->second))
+				{
+					/* Too many RTE-level matches */
+					rtestate->first = rtestate->second = InvalidAttrNumber;
+					/* Clearly, distance is too low a bar (for *any* RTE) */
+					rtestate->distance = columndistance - 1;
+				}
+				else
+				{
+					/* Record as provisional second match for RTE */
+					rtestate->second = attnum;
+				}
+			}
+		}
 	}
 
 	/*
@@ -642,7 +718,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL);
 
 			if (newresult)
 			{
@@ -668,8 +745,15 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns state with relevant details.
+ *
+ * Column name may be matched fuzzily;  we provide the closet column(s) if
+ * there was not an exact match.  Caller can depend on returned state to find
+ * right attribute.  If first attribute is InvalidAttrNumber, but corresponding
+ * RTE is set, that indicates an exact match (i.e. column name is present, but
+ * presumably not visible).  However, if the wrong alias was specified by user,
+ * the first match attribute *is* set.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,26 +762,180 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static FuzzyAttrMatchState *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location)
 {
 	ParseState *orig_pstate = pstate;
+	FuzzyAttrMatchState *state = palloc(sizeof(FuzzyAttrMatchState));
+	ListCell   *l;
+	int			i;
+
+	state->distance = INT_MAX;
+	state->rsecond = state->rfirst = NULL;
+	state->second = state->first = InvalidAttrNumber;
 
 	while (pstate != NULL)
 	{
-		ListCell   *l;
-
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry  *rte = (RangeTblEntry *) lfirst(l);
+			FuzzyAttrMatchState	rtestate;
+			bool			wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Typically, it is not useful to look for matches within join
+			 * RTEs;  they effectively duplicate other RTEs for our purposes,
+			 * and if a match is chosen from a join RTE, an unhelpful alias is
+			 * displayed in the final diagnostic message.
+			 */
+			if (rte->rtekind == RTE_JOIN)
+				continue;
+
+			/*
+			 * Get single best match (or pair of joint best matches, or no
+			 * match) from each RTE -- the best two columns ultimately
+			 * suggested may or may not both be from the same RTE.
+			 *
+			 * Initialize RTE's distance to INT_MAX (and not RT state's current
+			 * lowest distance) to ensure that per-RTE penalties do not distort
+			 * per-RT costing.
+			 */
+			rtestate.distance = INT_MAX;
+			rtestate.rsecond = rtestate.rfirst = NULL;
+			rtestate.second = rtestate.first = InvalidAttrNumber;
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rtestate);
+
+			/* Avoid totally non-matching RTEs (e.g. no RTE attributes) */
+			if (!AttributeNumberIsValid(rtestate.first))
+				continue;
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtestate.distance == 0)
+			{
+				/*
+				 * Exact match (for "wrong alias" or "wrong level" cases).
+				 *
+				 * Only consider first element for RTE, because there can only
+				 * be one exact match -- it doesn't seem worth considering the
+				 * case where there are multiple exact matches, so we're done.
+				 */
+				state->rfirst = rte;
+				state->first = wrongalias? rtestate.first : InvalidAttrNumber;
+				state->rsecond = NULL;
+				state->second = InvalidAttrNumber;
+
+				return state;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtestate.distance += 3;
+
+			if (rtestate.distance < state->distance)
+			{
+				/*
+				 * New, uncontested best match RTE, with 1 or 2 best match
+				 * columns
+				 */
+				state->distance = rtestate.distance;
+
+				state->rfirst = rte;
+				state->first = rtestate.first;
+				state->rsecond =
+					AttributeNumberIsValid(rtestate.second)? rte: NULL;
+				state->second = rtestate.second;
+			}
+			else if (rtestate.distance == state->distance)
+			{
+				/*
+				 * Can't have 3 or more matches at same distance.
+				 *
+				 * It's useful to provide two matches for the common case where
+				 * two range tables have single equidistant candidates, as when
+				 * an unqualified (and therefore would-be ambiguous) column
+				 * name is specified which is also misspelled by the user --
+				 * there is probably a foreign key relationship between
+				 * tables/RTEs.  It's also possible to usefully give two column
+				 * suggestions originating from the same RTE, which may be
+				 * useful when an alias strongly suggests that RTE, while there
+				 * are 2 somewhat close matches.
+				 *
+				 * However, when there are more than 2 equally distant matches,
+				 * that's probably because the matches are not useful at all,
+				 * so don't suggest anything.
+				 */
+				if (AttributeNumberIsValid(state->second) ||
+					AttributeNumberIsValid(rtestate.second))
+				{
+					/* 3 or more equidistant matches -- RTE is uninteresting */
+					state->rsecond = state->rfirst = NULL;
+					state->second = state->first = InvalidAttrNumber;
+					/* Clearly this distance is too low a bar generally */
+					state->distance--;
+				}
+				else
+				{
+					/* Record as provisional second match for RT */
+					Assert(state->rfirst != NULL &&
+						   AttributeNumberIsValid(state->first));
+					Assert(state->rsecond == NULL &&
+						   !AttributeNumberIsValid(state->second) );
+					state->rsecond = rte;
+					state->second = rtestate.first;
+				}
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Handle dropped columns, which can appear here as empty colnames per
+	 * remarks within scanRTEForColumn().  If either the first or second
+	 * suggested attributes are dropped, do not provide any suggestion.
+	 */
+	for (i = 0; i < 2; i++)
+	{
+		AttrNumber		closest;
+		RangeTblEntry  *rte;
+		char		   *closestcol;
+
+		rte = (i == 0 ? state->rfirst: state->rsecond);
+		closest = (i == 0 ? state->first: state->second);
+
+		if (!AttributeNumberIsValid(closest))
+			break;
+
+		closestcol = strVal(list_nth(rte->eref->colnames, closest - 1));
+
+		if (strcmp(closestcol, "") == 0)
+		{
+			state->rsecond = state->rfirst = NULL;
+			state->second = state->first = InvalidAttrNumber;
+			break;
+		}
+	}
+
+	/*
+	 * Distance must be less than a normalized threshold in order to avoid
+	 * completely ludicrous suggestions.  Note that a distance of 6 will be
+	 * seen when 6 deletions are required against actual attribute name, or 3
+	 * insertions/substitutions.
+	 */
+	if (state->distance > 6 && state->distance > strlen(colname) / 2)
+	{
+		state->rsecond = state->rfirst = NULL;
+		state->second = state->first = InvalidAttrNumber;
+	}
+
+	return state;
 }
 
 /*
@@ -2862,34 +3100,71 @@ void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	FuzzyAttrMatchState	   *state;
+	char				   *closestfirst = NULL;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
+	 *
+	 * TODO: improve this code (and also errorMissingRTE) to mention using
+	 * LATERAL if appropriate.
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	state = searchRangeTableForCol(pstate, relname, colname, location);
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
 	 *
-	 * TODO: improve this code (and also errorMissingRTE) to mention using
-	 * LATERAL if appropriate.
+	 * Extract closest col string for best match, if any.
+	 *
+	 * Infer an exact match referenced despite not being visible from the fact
+	 * that an attribute number was not present in state passed back -- this is
+	 * what is reported when !closestfirst.  There might also be an exact match
+	 * that was qualified with an incorrect alias, in which case closestfirst
+	 * will be set (so hint is the same as generic fuzzy case).
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	if (state->rfirst && AttributeNumberIsValid(state->first))
+		closestfirst = strVal(list_nth(state->rfirst->eref->colnames,
+									   state->first - 1));
+
+	if (!state->rsecond)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 state->rfirst? closestfirst?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, state->rfirst->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Extract closest col string for second, joint-best match, if any */
+		char				   *closestsecond;
+
+		closestsecond = strVal(list_nth(state->rsecond->eref->colnames,
+										state->second - 1));
+
+		/* Handle case where there are two equally useful column hints */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst,
+						 state->rsecond->eref->aliasname, closestsecond),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
index a8670e9..8d565c6 100644
--- a/src/backend/utils/adt/levenshtein.c
+++ b/src/backend/utils/adt/levenshtein.c
@@ -95,6 +95,15 @@ varstr_levenshtein(const char *source, int slen, const char *target, int tlen,
 #define STOP_COLUMN m
 #endif
 
+	/*
+	 * A common use for Levenshtein distance is to match attributes when building
+	 * diagnostic, user-visible messages.  Restrict the size of
+	 * MAX_LEVENSHTEIN_STRLEN at compile time so that this is guaranteed to
+	 * work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
 	m = pg_mbstrlen_with_len(source, slen);
 	n = pg_mbstrlen_with_len(target, tlen);
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..7ab966e 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -16,6 +16,24 @@
 
 #include "parser/parse_node.h"
 
+
+/*
+ * Support for fuzzily matching column.
+ *
+ * This is for building diagnostic messages, where non-exact matching
+ * attributes are suggested to the user.  The struct's fields may be facets of
+ * a particular RTE, or of an entire range table, depending on context.
+ */
+typedef struct
+{
+	int				distance;	/* Weighted distance (lowest so far) */
+	RangeTblEntry  *rfirst;		/* RTE of first */
+	AttrNumber		first;		/* Closest attribute so far */
+	RangeTblEntry  *rsecond;	/* RTE of second */
+	AttrNumber		second;		/* Second closest attribute so far */
+} FuzzyAttrMatchState;
+
+
 extern RangeTblEntry *refnameRangeTblEntry(ParseState *pstate,
 					 const char *schemaname,
 					 const char *refname,
@@ -35,7 +53,7 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, FuzzyAttrMatchState *rtestate);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..b24fa43 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1479,6 +1482,7 @@ select oid > 0, * from altstartwith; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altstartwith;
                ^
+HINT:  Perhaps you meant to reference the column "altstartwith"."col".
 select * from altstartwith;
  col 
 -----
@@ -1515,10 +1519,12 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altinhoid;
                ^
+HINT:  Perhaps you meant to reference the column "altinhoid"."col".
 select * from altwithoid;
  col 
 -----
@@ -1554,6 +1560,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
@@ -1580,6 +1587,7 @@ select oid > 0, * from altwithoid; -- fails
 ERROR:  column "oid" does not exist
 LINE 1: select oid > 0, * from altwithoid;
                ^
+HINT:  Perhaps you meant to reference the column "altwithoid"."col".
 select oid > 0, * from altinhoid;
  ?column? | col 
 ----------+-----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2501184..3ef5580 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3415,6 +3421,39 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 983f1b8..fb4abe6 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -4782,6 +4782,7 @@ END$$;
 ERROR:  column "foo" does not exist
 LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
                                         ^
+HINT:  Perhaps you meant to reference the column "room"."roomno".
 QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
 CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
 -- Check handling of errors thrown from/into anonymous code blocks.
diff --git a/src/test/regress/expected/rowtypes.out b/src/test/regress/expected/rowtypes.out
index 54525de..efd8fa9 100644
--- a/src/test/regress/expected/rowtypes.out
+++ b/src/test/regress/expected/rowtypes.out
@@ -452,6 +452,7 @@ select fullname.text from fullname;  -- error
 ERROR:  column fullname.text does not exist
 LINE 1: select fullname.text from fullname;
                ^
+HINT:  Perhaps you meant to reference the column "fullname"."last".
 -- same, but RECORD instead of named composite type:
 select cast (row('Jim', 'Beam') as text);
     row     
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c79b45c..01c80af 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2396,6 +2396,7 @@ select xmin, * from fooview;  -- fail, views don't have such a column
 ERROR:  column "xmin" does not exist
 LINE 1: select xmin, * from fooview;
                ^
+HINT:  Perhaps you meant to reference the column "fooview"."x".
 select reltoastrelid, relkind, relfrozenxid
   from pg_class where oid = 'fooview'::regclass;
  reltoastrelid | relkind | relfrozenxid 
diff --git a/src/test/regress/expected/without_oid.out b/src/test/regress/expected/without_oid.out
index cb2c0c0..e805a6a 100644
--- a/src/test/regress/expected/without_oid.out
+++ b/src/test/regress/expected/without_oid.out
@@ -46,6 +46,7 @@ SELECT count(oid) FROM wo;
 ERROR:  column "oid" does not exist
 LINE 1: SELECT count(oid) FROM wo;
                      ^
+HINT:  Perhaps you meant to reference the column "wo"."i".
 VACUUM ANALYZE wi;
 VACUUM ANALYZE wo;
 SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
@@ -81,6 +82,7 @@ SELECT count(oid) FROM create_table_test3;
 ERROR:  column "oid" does not exist
 LINE 1: SELECT count(oid) FROM create_table_test3;
                      ^
+HINT:  Perhaps you meant to reference the column "create_table_test3"."c1" or the column "create_table_test3"."c2".
 PREPARE table_source(int) AS
     SELECT a + b AS c1, a - b AS c2, $1 AS c3 FROM create_table_test;
 CREATE TABLE execute_with WITH OIDS AS EXECUTE table_source(1);
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 718e1d9..ca7f966 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1051,6 +1055,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
1.9.1

#92Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#91)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sat, Nov 15, 2014 at 7:36 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch incorporates this feedback.

The only user-visible difference between this revision and the
previous revision is that it's quite possible for two suggestion to
originate from the same RTE (there is exactly one change in the
regression test's expected output as compared to the last revision for
this reason. The regression tests are otherwise unchanged). It's still
not possible to see more than 2 suggestions under any circumstances,
no matter where they might have originated from, which I think is
appropriate -- we continue to not present any HINT in the event of 3
or more equidistant matches.

I think that the restructuring required to pass around a state
variable has resulted in somewhat clearer code.

Cool!

I'm grumpy about the distanceName() function. That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance(). But see below.

On a related note, I'm also grumpy about this comment:

+    /* Charge half as much per deletion as per insertion or per substitution */
+    return varstr_levenshtein_less_equal(actual, len, match, match_len,
+                                         2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does. I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights. And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

+ int location, FuzzyAttrMatchState *rtestate)

I suggest calling this "fuzzystate" rather than "rtestate"; it's not
the state of the RTE, but the state of the fuzzy matching.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName(). I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it. I also
suggest testing against NULL explicitly instead of implicitly. So
this block of code would end up as something like:

if (fuzzystate != NULL)
updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
the call to scanRTEForColumn(). Why not instead put this check into
updateFuzzyAttrMatchState? Then you can be sure you're not changing
the behavior in any other case.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState(). There's
little point in adding a suggestion only to throw it away again.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to. Can you
explain? It seems to account for a good deal of complexity.

ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".

That seems like a stretch. I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less. For a three-letter column name that means not suggesting
anything if more than one character is different. What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different. I'm not
quite sure I see how that's getting through, but I think it shouldn't.

ERROR: column fullname.text does not exist
LINE 1: select fullname.text from fullname;
^
+HINT: Perhaps you meant to reference the column "fullname"."last".

Same problem, only worse! They've only got one letter of four in common.

ERROR: column "xmin" does not exist
LINE 1: select xmin, * from fooview;
^
+HINT: Perhaps you meant to reference the column "fooview"."x".

Basically the same problem again. I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities. They probably should have something like
that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#92)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm grumpy about the distanceName() function. That seems too generic.
If we're going to keep this as it is, I suggest something like
computeRTEColumnDistance(). But see below.

Fair point.

On a related note, I'm also grumpy about this comment:

+    /* Charge half as much per deletion as per insertion or per substitution */
+    return varstr_levenshtein_less_equal(actual, len, match, match_len,
+                                         2, 1, 2, max);

The purpose of a code comment is to articulate WHY we did something,
rather than simply to restate what the code quite obviously does. I
haven't heard a compelling argument for why this should be 2, 1, 2
rather than the default 1, 1, 1; and I'm inclined to do the latter
unless you can make some very good argument for this combination of
weights. And if you can make such an argument, then there should be
comments so that the next person to come along and look at this code
doesn't go, huh, that's whacky, and change it.

Okay. I agree that that deserves a comment. The actual argument for
this formulation is that it just seems to work better that way. For
example:

"""
postgres=# \d orderlines
Table "public.orderlines"
Column | Type | Modifiers
-------------+----------+-----------
orderlineid | integer | not null
orderid | integer | not null
prod_id | integer | not null
quantity | smallint | not null
orderdate | date | not null

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
LINE 1: select qty from orderlines ;
^
HINT: Perhaps you meant to reference the column "orderlines"."quantity".
"""

The point is that the fact that the user supplied "qty" string has so
many fewer characters than what was obviously intended - "quantity" -
deserves to be weighed less. If you change the costing to weigh
character deletion as being equal to substitution/addition, this
example breaks. I also think it's pretty common to have noise words in
every attribute (e.g. every column in the "orderlines" table matches
"orderlines_*"), which might otherwise mess things up by overcharging
for deletion. Having extra characters in the correctly spelled column
name seems legitimately less significant to me.

Or, in other words: having actual characters from the misspelling
match the correct spelling (and having actual characters given not
fail to match) is most important. What was given by the user is more
important than what was not given but should have been, which is not
generally true for uses of Levenshtein distance. I reached this
conclusion through trying out the patch with a couple of real schemas,
and seeing what works best.

It's hard to express that idea tersely, in a comment, but I guess I'll try.

+ int location, FuzzyAttrMatchState *rtestate)

I suggest calling this "fuzzystate" rather than "rtestate"; it's not
the state of the RTE, but the state of the fuzzy matching.

The idea here was to differentiate this state from the overall range
table state (in general, FuzzyAttrMatchState may be one or the other).
But okay.

Within the scanRTEForColumn block, we have a rather large chunk of
code protected by if (rtestate), which contains the only call to
distanceName(). I suggest that we move all of this logic to a
separate, static function, and merge distanceName into it. I also
suggest testing against NULL explicitly instead of implicitly. So
this block of code would end up as something like:

if (fuzzystate != NULL)
updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

Okay.

In searchRangeTableForCol, I'm fairly certain that you've changed the
behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
the call to scanRTEForColumn(). Why not instead put this check into
updateFuzzyAttrMatchState? Then you can be sure you're not changing
the behavior in any other case.

I thought that I had avoided changing things (beyond what was
advertised as changed in relation to this most recent revision)
because I also changed things WRT multiple matches per RTE. It's
fuzzy. Anyway, yeah, I could do it there instead.

On a similar note, I think the dropped-column test should happen early
as well, probably again in updateFuzzyAttrMatchState(). There's
little point in adding a suggestion only to throw it away again.

Agreed.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to. Can you
explain? It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible (i.e. the existing reason for
errorMissingColumn() to call here), and the case where there is a
visible column, but our alias was the wrong one. I guess that could
live in errorMissingColumn(), but overall it's more convenient to do
it here, so that errorMissingColumn() handles things almost uniformly
and doesn't really have to care.

2. For non-exact (fuzzy) matches, it seems more useful to give one
match rather than two when the user gave an alias that matches one
particular RTE. Consider this:

"""
postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid" or the
column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3166

postgres=# select ol.ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3147
"""

One suggestion is better than two if it's evident that that single
suggestion is a better fit. And, more broadly, the fact that an alias
was given and matches ought to be weighed.

ERROR: column "oid" does not exist
LINE 1: select oid > 0, * from altwithoid;
^
+HINT: Perhaps you meant to reference the column "altwithoid"."col".

That seems like a stretch. I think I suggested before using a
distance threshold of at most 3 or half the word length, whichever is
less. For a three-letter column name that means not suggesting
anything if more than one character is different. What you
implemented here is close to that, yet somehow we've got a suggestion
slipping through that has 2 out of 3 characters different. I'm not
quite sure I see how that's getting through, but I think it shouldn't.

It's because I don't apply the test on smaller strings. I felt that it
was riskier to apply an absolute quality test when the user-supplied
column reference does not exceed 6 characters. Consider my "qty" vs
"quantity" example. If I make this change:

--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,7 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,
         * seen when 6 deletions are required against actual attribute
name, or 3
         * insertions/substitutions.
         */
-       if (state->distance > 6 && state->distance > strlen(colname) / 2)
+       if (state->distance > strlen(colname) / 2)
        {
                state->rsecond = state->rfirst = NULL;
                state->second = state->first = InvalidAttrNumber;

Then a lot of the examples you complain about are fixed. But the "qty"
example is broken. Plus, this happens when the regression tests are
run:

*** /home/pg/postgresql/src/test/regress/expected/alter_table.out
2014-11-17 11:50:16.476426191 -0800
--- /home/pg/postgresql/src/test/regress/results/alter_table.out
2014-11-17 11:57:40.776410110 -0800
***************
*** 1343,1349 ****
  ERROR:  column "f1" does not exist
  LINE 1: select f1 from c1;
                 ^
- HINT:  Perhaps you meant to reference the column "c1"."f2".
  drop table p1 cascade;
  NOTICE:  drop cascades to table c1
  create table p1 (f1 int, f2 int);

And:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-17
11:50:16.480426191 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-17
11:57:08.916411263 -0800
***************
*** 3452,3458 ****
  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
                 ^
- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
  --
  -- Test LATERAL
  --

(So no hint given in either case)

ERROR: column fullname.text does not exist
LINE 1: select fullname.text from fullname;
^
+HINT: Perhaps you meant to reference the column "fullname"."last".

Basically the same problem again. I think the distance threshold in
this case should be half the shorter column name, i.e. 0.

Well, there is always going to be the most marginal possible case that
still gets to see a suggestion. These are non-organic examples from
the regression tests. I'm more worried about having the suggestions
work well for organic/representative cases than I am about suppressing
non-useful suggestions in non-organic/non-representative cases. As I
mentioned, the costing is more or less derived by what I found to work
well in what I thought to be representative cases.

Your new test cases include no negative test cases; that is, cases
where the machinery declines to suggest a hint because of, say, 3
equally good possibilities. They probably should have something like
that.

I'll think about that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#93)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
LINE 1: select qty from orderlines ;
^
HINT: Perhaps you meant to reference the column "orderlines"."quantity".
"""

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous. The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to. Can you
explain? It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible (i.e. the existing reason for
errorMissingColumn() to call here), and the case where there is a
visible column, but our alias was the wrong one. I guess that could
live in errorMissingColumn(), but overall it's more convenient to do
it here, so that errorMissingColumn() handles things almost uniformly
and doesn't really have to care.

2. For non-exact (fuzzy) matches, it seems more useful to give one
match rather than two when the user gave an alias that matches one
particular RTE. Consider this:

"""
postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid" or the
column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3166

postgres=# select ol.ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3147
"""

I guess I'm confused at a broader level. If the alias is wrong, why
are we considering names in this RTE *at all*?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#94)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
LINE 1: select qty from orderlines ;
^
HINT: Perhaps you meant to reference the column "orderlines"."quantity".
"""

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous. The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

Is that so terrible? Yes, if those *exact* strings are tried, that'll
happen. But the vast majority of 3 letter strings will not do that
(including many 3 letter strings that include one of the letters 'q',
't' and 'y', such as "qqq", "ttt", and "yyy"). Why, in practice, would
someone even attempt those strings? I'm worried about Murphy, not
Machiavelli. That seems like a pretty important distinction here.

I maintain that omission of part of the correct spelling should be
weighed less. I am optimizing for the case where the user has a rough
idea of the structure and spelling of things - if they're typing in
random strings, or totally distinct synonyms, there is little we can
do about that. As I said, there will always be the most marginal case
that still gets a suggestion. I see no reason to hurt the common case
where we help in order to save the user from seeing a "ridiculous"
suggestion. I have a final test for the absolute quality of a
suggestion, but I think we could easily be too conservative about
that. At worst, our "ridiculous" suggestion makes apparent that the
user's incorrect spelling was itself ridiculous. With larger strings,
we can afford to be more conservative, and we are, because we have
more information to go on. Terse column names are not uncommon,
though.

+            /*
+             * Charge extra (for inexact matches only) when an alias was
+             * specified that differs from what might have been used to
+             * correctly qualify this RTE's closest column
+             */
+            if (wrongalias)
+                rtestate.distance += 3;

I don't understand what situation this is catering to. Can you
explain? It seems to account for a good deal of complexity.

I guess I'm confused at a broader level. If the alias is wrong, why
are we considering names in this RTE *at all*?

Because it's a common mistake when writing ad-hoc queries. People may
forget which exact table their column comes from. We certainly want to
weigh the fact that an alias was specified, but it shouldn't totally
limit our guess to that RTE. If nothing else, the fact that there was
a much closer match from another RTE ought to result in forcing there
to be no suggestion (due to there being too many equally good
suggestions). That's because, as I said, an *absolute* test for the
quality of a match is problematic (which, again, is why I err on the
side of letting the final, "absolute quality" test not limit
suggestions, particularly with short strings).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#95)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
HINT: Perhaps you meant to reference the column "orderlines"."quantity".

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous. The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

I maintain that omission of part of the correct spelling should be
weighed less.

I would say that omission of the first letter should completely disqualify
suggestions based on this heuristic; but it might make sense to weight
omissions less after the first letter.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#96)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 18, 2014 at 8:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@heroku.com> writes:

On Tue, Nov 18, 2014 at 3:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 17, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
HINT: Perhaps you meant to reference the column "orderlines"."quantity".

I don't buy this example, because it would give you the same hint if
you told it you wanted to access a column called ant, or uay, or tit.
And that's clearly ridiculous. The reason why quantity looks like a
reasonable suggestion for qty is because it's a conventional
abbreviation, but an extremely high percentage of comparable cases
won't be.

I maintain that omission of part of the correct spelling should be
weighed less.

I would say that omission of the first letter should completely disqualify
suggestions based on this heuristic; but it might make sense to weight
omissions less after the first letter.

I think we would be well-advised not to start inventing our own
approximate matching algorithm. Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters. Ugh.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#97)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we would be well-advised not to start inventing our own
approximate matching algorithm. Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters. Ugh.

I agree.

While I am perfectly comfortable with the fact that we are guessing
here, my guesses are based on what I observed to work well with real
schemas, and simulated errors that I thought were representative of
human error. Obviously it's possible that another scheme will do
better sometimes, including for example a scheme that picks a match
entirely at random. But on average, I think that what I have here will
do better than anything else proposed so far.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#98)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we would be well-advised not to start inventing our own
approximate matching algorithm. Peter's suggestion boils down to a
guess that the default cost parameters for Levenshtein suck, and your
suggestion boils down to a guess that we can fix the problems with
Peter's suggestion by bolting another heuristic on top of it - and
possibly running Levenshtein twice with different sets of cost
parameters. Ugh.

I agree.

While I am perfectly comfortable with the fact that we are guessing
here, my guesses are based on what I observed to work well with real
schemas, and simulated errors that I thought were representative of
human error. Obviously it's possible that another scheme will do
better sometimes, including for example a scheme that picks a match
entirely at random. But on average, I think that what I have here will
do better than anything else proposed so far.

If you agree, then I'm not being clear enough. I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word. Suggesting "col" for "oid" or "x" for "xmax", as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives. But if we
don't have that, we should do things that are like the things that
other people have done before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#99)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

If you agree, then I'm not being clear enough. I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

I was talking about the difficulty of bolting something on top of
Levenshtein distance that looked at the first character. That would
not need two costings, but would require an encoding aware matching of
the first code point.

Suggesting "col" for "oid" or "x" for "xmax", as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:

"""
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
- HINT: Perhaps you meant to reference the column "c1"."f2".
"""

And:

"""
ERROR: column atts.relid does not exist
LINE 1: select atts.relid::regclass, s.* from pg_stats s join
^
- HINT: Perhaps you meant to reference the column "atts"."indexrelid".
"""

Those are really useful suggestions! And, they're much more
representative of real user error.

The downside of weighing deletion less than substitution and insertion
is much smaller than the upside. It's worth it. The downside is only
that the user gets to see the best suggestion that isn't all that good
in an absolute sense (which we have a much harder time concluding
using simple tests for short misspellings).

If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives. But if we
don't have that, we should do things that are like the things that
other people have done before.

That seems totally impractical. No one has that kind of data that I'm aware of.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:

"""
$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add
"""

And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

Now, this won't give a suggestion:

"""
$ git aaaa
git: 'aaaa' is not a git command. See 'git --help'.
"""

So it looks like git similarly weighs deletion less than
insertion/substitution. Are its suggestions any less ridiculous than
your examples of questionable hints from the modified regression test
expected output? This is just a guidance mechanism, and at worst we'll
show the best match (on the mechanisms own terms, which isn't too
bad).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#100)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:22 AM, Peter Geoghegan <pg@heroku.com> wrote:

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:

Maybe you'd prefer if there was a more gradual ramp-up to requiring a
distance of no greater than 50% of the string size (normalized to take
account of my non-default costings). Right now it's a step function of
the number of characters in the string - there is no "absolute
quality" requirement for strings of 6 or fewer requirements.
Otherwise, there is the 50% distance absolute quality test (the test
that you want to be applied generally). I think that would be better,
without being much more complicated.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#101)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote:

there is no "absolute
quality" requirement for strings of 6 or fewer requirements.

I meant 6 or fewer *characters*, obviously.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#101)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 10:33 AM, Peter Geoghegan <pg@heroku.com> wrote:

Maybe you'd prefer if there was a more gradual ramp-up to requiring a
distance of no greater than 50% of the string size (normalized to take
account of my non-default costings)

I made this modification:

diff --git a/src/backend/parser/parse_relation.c
b/src/backend/parser/parse_relation.c
index 40c69d7..cca075f 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,8 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,
         * seen when 6 deletions are required against actual attribute
name, or 3
         * insertions/substitutions.
         */
-       if (state->distance > 6 && state->distance > strlen(colname) / 2)
+       if ((state->distance > 3 && state->distance > strlen(colname)) ||
+               (state->distance > 6 && state->distance > strlen(colname) / 2))
        {
                state->rsecond = state->rfirst = NULL;
                state->second = state->first = InvalidAttrNumber;

When I run the regression tests now, then all the cases that you found
objectionable in the regression tests' previous expected output
disappear, while all the cases I think are useful that were previously
removed by applying a broad 50% standard remain. While I'm not 100%
sure that this exact formulation is the best one, I think that we can
reach a compromise on this point, that allows the costing to remain
the same without offering particularly bad suggestions for short
strings.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#100)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 1:22 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

If you agree, then I'm not being clear enough. I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

That's precisely the time I think it's *most* important. In a very
long string, the threshold should be LESS than 50%. My original
proposal was "no more than 3 characters of difference, but in any
event not more than half the length of the shorter string".

Suggesting "col" for "oid" or "x" for "xmax", as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:

"""
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
- HINT: Perhaps you meant to reference the column "c1"."f2".
"""

That's exactly 50%, not more than 50%.

(I'm also on the fence about whether the hint is actually helpful in
that case, but the rule I proposed wouldn't prohibit it.)

And:

"""
ERROR: column atts.relid does not exist
LINE 1: select atts.relid::regclass, s.* from pg_stats s join
^
- HINT: Perhaps you meant to reference the column "atts"."indexrelid".
"""

Those are really useful suggestions! And, they're much more
representative of real user error.

That one's right at 50% too, but it's certainly more than 3 characters
of difference. I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:

"""
$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add
"""

And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

I went and found the actual code git uses for this. It's here:

https://github.com/git/git/blob/d29e9c89dbbf0876145dc88615b99308cab5f187/help.c

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered. And that does
overall seem to give pretty good suggestions. However, an interesting
point about the git algorithm is that it's not hard to make it do
stupid things on short strings:

[rhaas pgsql]$ git xy
git: 'xy' is not a git command. See 'git --help'.

Did you mean one of these?
am
gc
mv
rm

Maybe they should adopt my idea of a lower cutoff for short strings. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#104)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered. And that does
overall seem to give pretty good suggestions.

The git people know that no git command is longer than 4 or 5
characters. That doesn't apply to us. I certainly would not like to
have an absolute distance test of n characters.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#104)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That's precisely the time I think it's *most* important. In a very
long string, the threshold should be LESS than 50%. My original
proposal was "no more than 3 characters of difference, but in any
event not more than half the length of the shorter string".

We can only hint based on the information given by the user. If they
give a lot of badly matching information, we have something to go on.

That one's right at 50% too, but it's certainly more than 3 characters
of difference. I think it's going to be pretty hard to emit a
suggestion in that case but not in a whole lot of cases that don't
make any sense.

I don't think that's the case. Other RTEs are penalized for having
non-matching aliases here.

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#104)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Robert Haas wrote:

And the underlying Levenshtein implementation is here:

https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/levenshtein.c

Apparently what they're doing is charging 0 for a transposition (which
we don't have as a separate concept), 2 for a substitution, 1 for an
insertion, and 3 for a deletion, with the constraint that anything
with a total distance of more than 6 isn't considered.

0 for a transposition, wow. I suggested adding transpositions but there
was no support for that idea. I suggested it because I thikn it's the
most common form of typo, and charging 2 for a deletion plus 1 for an
insertion makes a single transposition mistaek count as 3, which seems
wrong -- particularly seeing the git precedent.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#107)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#108)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

If there's logic in your statement, I can't see it.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#109)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Peter Geoghegan wrote:

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

If there's logic in your statement, I can't see it.

The point is that transposition errors should not have no cost. If git
did not have an absolute quality test of a distance of 6, which they
can only have because all git commands are terse, then you could
construct a counter example.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#110)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan wrote:

On Wed, Nov 19, 2014 at 11:54 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Peter Geoghegan wrote:

On Wed, Nov 19, 2014 at 11:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

0 for a transposition, wow.

Again, they're optimizing for short strings (git commands) only. There
just isn't that many transposition errors possible with a 4 character
string.

If there's logic in your statement, I can't see it.

The point is that transposition errors should not have no cost. If git
did not have an absolute quality test of a distance of 6, which they
can only have because all git commands are terse, then you could
construct a counter example.

Okay. My point is just that whatever the string length, I think we'd do
well to regard transpositions as "cheap" in terms of error cost.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#106)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Nov 19, 2014 at 2:33 PM, Peter Geoghegan <pg@heroku.com> wrote:

I don't think that's the case. Other RTEs are penalized for having
non-matching aliases here.

The point I made did not, as far as I can see, have anything to do
with non-matching aliases; it could arise with just a single RTE.

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as "stupid". Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.
Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby. Or think about how annoying it is when a
spell-checker or grammar-checker underlines something you've written
that is, in your own opinion, correctly spelled or grammatical. Maybe
that kind of thing doesn't annoy you, but it definitely annoys me, and
I think probably a lot of other people. My general experience is that
people get quite pissed off by bad suggestions from a computer. At
least in my experience, users' actual level of agitation is often all
out of proportion to what might seem justified, but we are designing
this software for actual users, so their likely emotional reactions
are relevant.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones. I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest "quantity" for "tit" is going too far. If the user types
"qty" when they meant "quantity", they probably don't really need the
hint, because they're going to say to themselves "wait, I guess I
didn't abbreviate that". The time when they need the hint is when
they typed "quanttiy", because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one. You're sitting there puzzling over where the quantity column
went, and asking yourselves how you can be mis-remembering the schema,
and saying "wait, didn't I just see that column in the \d output" ...
and you don't even think to check carefully for a spelling mistake.
The hint may well clue you in to what the real problem is.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o. We
won't be able to do the latter accurately enough to make it more
useful than annoying.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#112)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Robert Haas <robertmhaas@gmail.com> writes:

... In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o. We
won't be able to do the latter accurately enough to make it more
useful than annoying.

FWIW, I concur with Robert's analysis that wrong suggestions are likely to
be annoying. We should be erring on the side of not making a suggestion
rather than making one that's a low-probability guess.

I'm not particularly convinced that the "f1" -> "f2" example is a useful
behavior, and I'm downright horrified by the "qty" -> "quantity" case.
If the hint mechanism thinks the latter two are close enough together
to suggest, it's going to be spewing a whole lot of utterly ridiculous
suggestions. I'm going to be annoyed way more times than I'm going to
be helped.

The big picture is that this is more or less our first venture into
heuristic suggestions. I think we should start slow with a very
conservative set of heuristics. If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#113)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 8:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not particularly convinced that the "f1" -> "f2" example is a useful
behavior, and I'm downright horrified by the "qty" -> "quantity" case.
If the hint mechanism thinks the latter two are close enough together
to suggest, it's going to be spewing a whole lot of utterly ridiculous
suggestions. I'm going to be annoyed way more times than I'm going to
be helped.

I happen to think that that isn't the case, because the number of
possible suggestions is fairly low anyway, and people don't tend to
make those kind of errors. Robert's examples of "ridiculous"
suggestions of "quantity" based on three letter strings other than
"qty" (e.g. "tit") were rather contrived. In fact, most 3 letter
strings will not offer a suggestion. 3 or more Equidistant would-be
matches tend to offer a lot of additional protection against bad
suggestions for these terse strings.

The big picture is that this is more or less our first venture into
heuristic suggestions. I think we should start slow with a very
conservative set of heuristics. If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.

I certainly see your point here. It's not as if we have an *evolved*
understanding of the usability issues. Besides, as Robert pointed out,
most of the value of this patch is added by simple cases, like a
failure to pluralize or not pluralize, or the omission of an
underscore.

I still think we should charge half for deletion, but I will concede
that it's prudent to apply a more restrictive absolute quality final
test.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#112)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

In general, I think the cost of a bad suggestion is much lower than
the benefit of a good one. You seem to be suggesting that they're
equal. Or that they're equally likely in an organic situation. In my
estimation, this is not the case at all.

The way I see it, the main cost of a bad suggestion is that it annoys
the user with clutter which they may brand as "stupid". Think about
how much vitriol has been spewed over the years against progress bars
(or estimated completion) times that don't turn out to mirror reality.

Well, you can judge the quality of the suggestion immediately. I
imagined a mechanism that gives a little bit more than the minimum
amount of guidance for things like contractions/abbreviations.

Microsoft has gotten more cumulative flack about their inaccurate
progress bars over the years than they would have for dropping an
elevator on a cute baby.

I haven't used a more recent version of Windows than Windows Vista,
but I'm pretty sure that they kept it up.

I'm curious about your thoughts on the compromise of a ramped up
distance threshold to apply a test for the absolute quality of a
match. I think that the fact that git gives bad suggestions with terse
strings tells us a lot, though. Note that unlike git, with terse
strings we may well have a good deal more equidistant matches, and as
soon as the number of would-be matches exceeds 2, we actually give no
matches at all. So that's an additional protection against poor
matches with terse strings.

I don't know what you mean by a ramped-up distance threshold, exactly.
I think it's good for the distance threshold to be lower for small
strings and higher for large ones. I think I'm somewhat open to
negotiation on the details, but I think any system that's going to
suggest "quantity" for "tit" is going too far.

I mean the suggestion of raising the cost threshold more gradually,
not as a step function of the number of characters in the string [1]/messages/by-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6yPVM_nt9NjfovP3XwjixDw@mail.gmail.com
where it's either over 6 characters and must pass the 50% test, or
isn't and has no absolute quality test. The exact modification I
described will FWIW remove the "quantity" for "qty" suggestion, as
well as all the similar suggestions that you found objectionable (like
"tit" also offering a suggestion of "quantity").

If you look at the regression tests, none of the sensible suggestions
are lost (some would be by an across the board 50% absolute quality
threshold, as I previously pointed out [2]/messages/by-id/CAM3SWZTSGokNhT8rK+0Eed7spNJg4pAdMbqqYi0FH9bWcNvTGA@mail.gmail.com -- Peter Geoghegan), but all the bad ones are.
I attach failed regression test output showing the difference between
the previous expected values, and actual values with that small
modification - it looks like most or all bad cases are now fixed.

If the user types
"qty" when they meant "quantity", they probably don't really need the
hint, because they're going to say to themselves "wait, I guess I
didn't abbreviate that". The time when they need the hint is when
they typed "quanttiy", because it's quite possible to read a query
with that sort of typo multiple times and not realize that you've made
one.

I agree that that's a more important case.

In other words, I think there's value in trying to clue somebody in
when they've made a typo, but not when they've made a think-o. We
won't be able to do the latter accurately enough to make it more
useful than annoying.

That's certainly true; I think that we only disagree about the exact
point at which we enter the think-o correction business.

[1]: /messages/by-id/CAM3SWZT+7hH29Go6ZuY2OrCS40=6yPVM_nt9NjfovP3XwjixDw@mail.gmail.com
[2]: /messages/by-id/CAM3SWZTSGokNhT8rK+0Eed7spNJg4pAdMbqqYi0FH9bWcNvTGA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /home/pg/postgresql/src/test/regress/expected/rules.out	2014-11-20 10:17:55.046291912 -0800
--- /home/pg/postgresql/src/test/regress/results/rules.out	2014-11-20 10:20:41.062285903 -0800
***************
*** 2396,2402 ****
  ERROR:  column "xmin" does not exist
  LINE 1: select xmin, * from fooview;
                 ^
- HINT:  Perhaps you meant to reference the column "fooview"."x".
  select reltoastrelid, relkind, relfrozenxid
    from pg_class where oid = 'fooview'::regclass;
   reltoastrelid | relkind | relfrozenxid 
--- 2396,2401 ----

======================================================================

*** /home/pg/postgresql/src/test/regress/expected/plpgsql.out	2014-11-20 10:17:55.046291912 -0800
--- /home/pg/postgresql/src/test/regress/results/plpgsql.out	2014-11-20 10:20:51.230285535 -0800
***************
*** 4782,4788 ****
  ERROR:  column "foo" does not exist
  LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
                                          ^
- HINT:  Perhaps you meant to reference the column "room"."roomno".
  QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
  CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
  -- Check handling of errors thrown from/into anonymous code blocks.
--- 4782,4787 ----

======================================================================

*** /home/pg/postgresql/src/test/regress/expected/without_oid.out	2014-11-20 10:17:55.050291912 -0800
--- /home/pg/postgresql/src/test/regress/results/without_oid.out	2014-11-20 10:20:52.702285482 -0800
***************
*** 46,52 ****
  ERROR:  column "oid" does not exist
  LINE 1: SELECT count(oid) FROM wo;
                       ^
- HINT:  Perhaps you meant to reference the column "wo"."i".
  VACUUM ANALYZE wi;
  VACUUM ANALYZE wo;
  SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
--- 46,51 ----
***************
*** 82,88 ****
  ERROR:  column "oid" does not exist
  LINE 1: SELECT count(oid) FROM create_table_test3;
                       ^
- HINT:  Perhaps you meant to reference the column "create_table_test3"."c1" or the column "create_table_test3"."c2".
  PREPARE table_source(int) AS
      SELECT a + b AS c1, a - b AS c2, $1 AS c3 FROM create_table_test;
  CREATE TABLE execute_with WITH OIDS AS EXECUTE table_source(1);
--- 81,86 ----

======================================================================

*** /home/pg/postgresql/src/test/regress/expected/alter_table.out	2014-11-20 10:17:55.038291912 -0800
--- /home/pg/postgresql/src/test/regress/results/alter_table.out	2014-11-20 10:20:56.710285337 -0800
***************
*** 1482,1488 ****
  ERROR:  column "oid" does not exist
  LINE 1: select oid > 0, * from altstartwith;
                 ^
- HINT:  Perhaps you meant to reference the column "altstartwith"."col".
  select * from altstartwith;
   col 
  -----
--- 1482,1487 ----
***************
*** 1519,1530 ****
  ERROR:  column "oid" does not exist
  LINE 1: select oid > 0, * from altwithoid;
                 ^
- HINT:  Perhaps you meant to reference the column "altwithoid"."col".
  select oid > 0, * from altinhoid; -- fails
  ERROR:  column "oid" does not exist
  LINE 1: select oid > 0, * from altinhoid;
                 ^
- HINT:  Perhaps you meant to reference the column "altinhoid"."col".
  select * from altwithoid;
   col 
  -----
--- 1518,1527 ----
***************
*** 1560,1566 ****
  ERROR:  column "oid" does not exist
  LINE 1: select oid > 0, * from altwithoid;
                 ^
- HINT:  Perhaps you meant to reference the column "altwithoid"."col".
  select oid > 0, * from altinhoid;
   ?column? | col 
  ----------+-----
--- 1557,1562 ----
***************
*** 1587,1593 ****
  ERROR:  column "oid" does not exist
  LINE 1: select oid > 0, * from altwithoid;
                 ^
- HINT:  Perhaps you meant to reference the column "altwithoid"."col".
  select oid > 0, * from altinhoid;
   ?column? | col 
  ----------+-----
--- 1583,1588 ----

======================================================================

*** /home/pg/postgresql/src/test/regress/expected/rowtypes.out	2014-11-20 10:17:55.046291912 -0800
--- /home/pg/postgresql/src/test/regress/results/rowtypes.out	2014-11-20 10:20:57.306285315 -0800
***************
*** 452,458 ****
  ERROR:  column fullname.text does not exist
  LINE 1: select fullname.text from fullname;
                 ^
- HINT:  Perhaps you meant to reference the column "fullname"."last".
  -- same, but RECORD instead of named composite type:
  select cast (row('Jim', 'Beam') as text);
      row     
--- 452,457 ----

======================================================================

#116Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#115)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 1:30 PM, Peter Geoghegan <pg@heroku.com> wrote:

I mean the suggestion of raising the cost threshold more gradually,
not as a step function of the number of characters in the string [1]
where it's either over 6 characters and must pass the 50% test, or
isn't and has no absolute quality test. The exact modification I
described will FWIW remove the "quantity" for "qty" suggestion, as
well as all the similar suggestions that you found objectionable (like
"tit" also offering a suggestion of "quantity").

If you look at the regression tests, none of the sensible suggestions
are lost (some would be by an across the board 50% absolute quality
threshold, as I previously pointed out [2]), but all the bad ones are.
I attach failed regression test output showing the difference between
the previous expected values, and actual values with that small
modification - it looks like most or all bad cases are now fixed.

That does seem to give better results, but it still seems awfully
complicated. If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#116)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That does seem to give better results, but it still seems awfully
complicated. If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***************
*** 3452,3458 ****
  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
                 ^
- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".

Within the catalogs, the names of attributes are prefixed as a form of
what you might call internal namespacing. For example, pg_index has
attributes that all begin with "ind*". You could easily omit something
like that, while still more or less knowing what you're looking for.

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR: 42703: column "key" does not exist
LINE 1: select key from pg_index;
^

Only this does:

postgres=# select ikey from pg_index;
ERROR: 42703: column "ikey" does not exist
LINE 1: select ikey from pg_index;
^
HINT: Perhaps you meant to reference the column "pg_index"."indkey".
postgres=#

The git people varied their Levenshtein costings for a reason.

I also think that a one size fits all cap will break things. It will
independently break the example above, as well as the more marginal
"c1"."f2". vs "c1"."f2" case (okay, maybe that case was exactly on the
threshold, but others won't be).

I don't see that different costings actually saves any complexity.
Similarly, the final cap is quite straightforward. Anything with any
real complexity happens before that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#117)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 3:00 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That does seem to give better results, but it still seems awfully
complicated. If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***************
*** 3452,3458 ****
ERROR:  column atts.relid does not exist
LINE 1: select atts.relid::regclass, s.* from pg_stats s join
^
- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".

Within the catalogs, the names of attributes are prefixed as a form of
what you might call internal namespacing. For example, pg_index has
attributes that all begin with "ind*". You could easily omit something
like that, while still more or less knowing what you're looking for.

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR: 42703: column "key" does not exist
LINE 1: select key from pg_index;
^

Only this does:

postgres=# select ikey from pg_index;
ERROR: 42703: column "ikey" does not exist
LINE 1: select ikey from pg_index;
^
HINT: Perhaps you meant to reference the column "pg_index"."indkey".
postgres=#

Seems fine to me. If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo. ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119Andres Freund
andres@2ndquadrant.com
In reply to: Peter Geoghegan (#117)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On 2014-11-20 12:00:51 -0800, Peter Geoghegan wrote:

In more concrete terms, this gets no suggestion:

postgres=# select key from pg_index;
ERROR: 42703: column "key" does not exist
LINE 1: select key from pg_index;
^

I don't think that's a bad thing. Yes, for a human those look pretty
similar, but it's easy to construct cases where that gives completely
hilarious results.

I think something simplistic like levenshtein, even with modified
distances, is good to catch typos. But not to find terms that are
related in more complex ways.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#118)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Seems fine to me. If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo. ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

I can tell that I have no chance of convincing you otherwise. While I
think you're mistaken to go against the precedent set by git, you're
the one with the commit bit, and I think we've already spent enough
time discussing this. So default costings it is.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#117)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Peter Geoghegan <pg@heroku.com> writes:

On Thu, Nov 20, 2014 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That does seem to give better results, but it still seems awfully
complicated. If we just used Levenshtein with all-default cost
factors and a distance cap equal to Max(strlen(what_user_typed),
strlen(candidate_match), 3), what cases that you think are important
would be harmed?

Well, just by plugging in default Levenshtein cost factors, I can see
the following regression:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-20
10:17:55.042291912 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-20
11:42:15.670108745 -0800
***************
*** 3452,3458 ****
ERROR:  column atts.relid does not exist
LINE 1: select atts.relid::regclass, s.* from pg_stats s join
^
- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".

I do not have a problem with deciding that that is not a "regression";
in fact, not giving that hint seems like a good conservative behavior
here. By your logic, we should also be prepared to suggest
"supercalifragilisticexpialidocious" when the user enters "ocious".
It's simply a bridge too far for what is supposed to be a hint for
minor typos. You sound like you want to turn it into something that
will look up column names for people who are too lazy to even try to
type the right thing. While I can see the value of such a tool within
certain contexts, firing completed queries at a live SQL engine
is not one of them.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Peter Geoghegan
pg@heroku.com
In reply to: Tom Lane (#121)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I do not have a problem with deciding that that is not a "regression";
in fact, not giving that hint seems like a good conservative behavior
here. By your logic, we should also be prepared to suggest
"supercalifragilisticexpialidocious" when the user enters "ocious".

That's clearly not true. I just want to be a bit more forgiving of
omissions. That clearly isn't my logic, since that isn't a suggestion
that the implementation will give, or would be anywhere close to
giving - my weighing of deletions is only twice that of substitutions
or insertions, not ten times. git does not use Levenshtein default
costings either.

It's simply a bridge too far for what is supposed to be a hint for
minor typos.

Minor typos and minor omissions. My example was on the edge of what
would be tolerable under my proposed cost model.

You sound like you want to turn it into something that
will look up column names for people who are too lazy to even try to
type the right thing. While I can see the value of such a tool within
certain contexts, firing completed queries at a live SQL engine
is not one of them.

It's just a hint; a convenience. Users who imagine that it takes away
the need for putting any thought into their SQL queries have bigger
problems.

Anyway, that's all that needs to be said on that, since I've already
given up on a non-default costing. Also, we have default costing, and
we always apply a 50% standard (I see no point in doing otherwise with
default costings).

Robert: Where does that leave us? What about suggestions across RTEs?
Alias costing, etc?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#120)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 3:20 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Nov 20, 2014 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Seems fine to me. If you typed relid rather than indexrelid or key
rather than indkey, that's a thinko, not a typo. ikey for indkey
could plausible be a typo, though you'd have to be having a fairly bad
day at the keyboard.

I can tell that I have no chance of convincing you otherwise. While I
think you're mistaken to go against the precedent set by git, you're
the one with the commit bit, and I think we've already spent enough
time discussing this. So default costings it is.

I've got a few +1s, too, if you notice.

I'm willing to be outvoted, but not by a majority of one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#123)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Thu, Nov 20, 2014 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I've got a few +1s, too, if you notice.

Then maybe I spoke too soon.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125David G Johnston
david.g.johnston@gmail.com
In reply to: Andres Freund (#119)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Andres Freund-3 wrote

I think something simplistic like levenshtein, even with modified
distances, is good to catch typos. But not to find terms that are
related in more complex ways.

Tom Lane-2 wrote

The big picture is that this is more or less our first venture into
heuristic suggestions. I think we should start slow with a very
conservative set of heuristics. If it's a success maybe we can get more
aggressive over time --- but if we go over the top here, the entire
concept will be discredited in this community for the next ten years.

+1 for both of these conclusions.

The observations regarding standard column prefixes and thinking that
abbreviations are in use when in fact the names are spelled out are indeed
in-the-wild behaviors that should be considered but a levenshtein distance
algorithm is likely not going to be useful in pointing out mistakes in those
situations. Limiting the immediate focus to "fat/thin-fingering of keys" -
for which levenshtein is well suited - is useful and will provide data
points that can then guide future artificial intelligence endeavors.

David J.

--
View this message in context: http://postgresql.nabble.com/Doing-better-at-HINTing-an-appropriate-column-within-errorMissingColumn-tp5797700p5827786.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#124)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Alright, so let me summarize what I think are the next steps in
working towards getting this patch committed. I should produce a new
revision which:

* Uses default costings.

* Applies a generic final quality check that enforces a distance of no
greater than 50% of the total string size. (The use of default
costings removes any reason to continue to do this)

* Work through Robert's suggestions on other aspects that need work
[1]: /messages/by-id/CA+TgmoZLwzgyv=JAYfi6XfAK8OcBuTPYYhP5TbOqsS=YWVvzUw@mail.gmail.com -- Peter Geoghegan

What is unclear is whether or not I should continue to charge extra
for non-matching user supplied alias (and, I think more broadly,
consider multiple RTEs iff the user did use an alias) - Robert was
skeptical, but didn't seem to have made his mind up. I still think I
should cost things based on aliases, and consider multiple RTEs even
when the user supplied an alias (the penalty should just be a distance
of 1 and not 3, though, in light of other changes to the
weighing/costing). If I don't hear anything in the next day or two,
I'll more or less preserve aliases-related aspects of the patch.

Did I miss something else?

[1]: /messages/by-id/CA+TgmoZLwzgyv=JAYfi6XfAK8OcBuTPYYhP5TbOqsS=YWVvzUw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#126)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 25, 2014 at 4:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

If I don't hear anything in the next day or two,
I'll more or less preserve aliases-related aspects of the patch.

FYI, I didn't go ahead and work on this, because I thought that the
thanksgiving holiday in the US probably kept you from giving feedback.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#126)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

Alright, so let me summarize what I think are the next steps in
working towards getting this patch committed. I should produce a new
revision which:

* Uses default costings.

* Applies a generic final quality check that enforces a distance of no
greater than 50% of the total string size. (The use of default
costings removes any reason to continue to do this)

* Work through Robert's suggestions on other aspects that need work
[1], most of which I already agreed to.

Sounds good so far.

What is unclear is whether or not I should continue to charge extra
for non-matching user supplied alias (and, I think more broadly,
consider multiple RTEs iff the user did use an alias) - Robert was
skeptical, but didn't seem to have made his mind up. I still think I
should cost things based on aliases, and consider multiple RTEs even
when the user supplied an alias (the penalty should just be a distance
of 1 and not 3, though, in light of other changes to the
weighing/costing). If I don't hear anything in the next day or two,
I'll more or less preserve aliases-related aspects of the patch.

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name. I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case ("HINT: You must refer to this as column
as "prime_minister.id" rather than "cameron.id").

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id. The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#128)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name. I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case ("HINT: You must refer to this as column
as "prime_minister.id" rather than "cameron.id").

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

I don't about that either. Aliases are often totally arbitrary,
particularly for ad-hoc queries, which is what this is aimed at.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id.

I'm not sure I follow. If there is an automobile.ip, then it will be
suggested. If there is no automobile column that's much of a match (so
no "automobile.ip", say), then student.id will be suggested (and not
student.iq, *even if there is no student.id* - the final quality check
saves us). So this is possible:

postgres=# select iq, * from student, automobile;
ERROR: 42703: column "iq" does not exist
LINE 1: select iq, * from student, automobile;
^
HINT: Perhaps you meant to reference the column "student"."id".
postgres=# select automobile.iq, * from student, automobile;
ERROR: 42703: column automobile.iq does not exist
LINE 1: select automobile.iq, * from student, automobile;
^

(note that using the table name makes us *not* see a suggestion where
we otherwise would).

The point is that there is a fixed penalty for a wrong user-specified
alias, but all relation RTEs are considered.

The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

I think it makes sense that it matters by a fixed amount. Besides,
this seems complicated enough already - I don't won't to add more
complexity to worry about equidistant (but still actually valid)
RTE/table/alias names.

It sounds like your concern here is mostly a concern about the
relative distance among multiple matches, as opposed to the absolute
quality of suggestions. The former seems a lot less controversial than
the latter was, though - the user always gets the best match, or the
join pair of best matches, or no match when this new hinting mechanism
is involved.

I attach a new revision. The revision:

* Uses default costs for Levenshtein distance.

* Still charges extra for a non-alias-matching match (although it only
charges a fixed distance of 1 extra). This has regression test
coverage.

* Applies a generic final quality check that enforces a requirement
that a hint have a distance of no greater than 50% of the total string
size. No special treatment of shorter strings is involved anymore.

* Moves almost everything out of scanRTEForColumn() as you outlined
(into a new function, updateFuzzyAttrMatchState(), per your
suggestion).

* Moves dropped column detection into updateFuzzyAttrMatchState(), per
your suggestion.

* Still does the "if (rte->rtekind == RTE_JOIN)" thing in the existing
function searchRangeTableForCol().

I am quite confident that a suggestion from a join RTE will never be
useful, to either the existing use of searchRangeTableForCol() or this
expanded use, and it makes more sense to me to put it there. In fact,
the existing use of searchRangeTableForCol() is really rather similar
to this, and will give up on the first identical match (which is taken
as evidence that there is a attribute of that name, but isn't visible
at this level of the query). So I have not followed your suggestion
here.

Thoughts?
--
Peter Geoghegan

Attachments:

0001-Levenshtein-distance-column-HINT.patchtext/x-patch; charset=US-ASCII; name=0001-Levenshtein-distance-column-HINT.patchDownload
From 81c7b0691e9d03c1bdd99f4b264737306d1bd2cf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 321 +++++++++++++++++++++++++++---
 src/backend/utils/adt/levenshtein.c       |   9 +
 src/include/parser/parse_relation.h       |  20 +-
 src/test/regress/expected/alter_table.out |   3 +
 src/test/regress/expected/join.out        |  38 ++++
 src/test/regress/sql/join.sql             |  24 +++
 8 files changed, 388 insertions(+), 38 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..f32bf40 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -520,6 +521,69 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * updateFuzzyAttrMatchState
+ *	  Using Levenshtein distance, consider if column is best fuzzy match.
+ */
+static void
+updateFuzzyAttrMatchState(FuzzyAttrMatchState *fuzzystate, const char *actual,
+						  const char *match, int attnum)
+{
+	int		columndistance;
+
+	/*
+	 * Outright reject dropped columns, which can appear here with apparent
+	 * empty actual names, per remarks within scanRTEForColumn().
+	 */
+	if (strcmp(actual, "") == 0)
+		columndistance = INT_MAX;
+	else
+		/* Use standard costs for Levenshtein distance */
+		columndistance = varstr_levenshtein_less_equal(actual, strlen(actual),
+													   match, strlen(match),
+													   1, 1, 1,
+													   fuzzystate->distance);
+
+	if (columndistance < fuzzystate->distance)
+	{
+		/* Store new lowest observed distance for RTE */
+		fuzzystate->distance = columndistance;
+		fuzzystate->first = attnum;
+		fuzzystate->second = InvalidAttrNumber;
+	}
+	else if (columndistance == fuzzystate->distance)
+	{
+		/*
+		 * This match distance may equal a prior match within this same
+		 * range table.  When that happens, the prior match may also be
+		 * given, but only if there is no more than two equally distant
+		 * matches from the RTE (in turn, our caller will only accept
+		 * two equally distant matches overall).
+		 */
+		if (AttributeNumberIsValid(fuzzystate->second))
+		{
+			/* Too many RTE-level matches */
+			fuzzystate->first = fuzzystate->second = InvalidAttrNumber;
+			/* Clearly, distance is too low a bar (for *any* RTE) */
+			fuzzystate->distance = columndistance - 1;
+		}
+		else if (AttributeNumberIsValid(fuzzystate->first))
+		{
+			/* Record as provisional second match for RTE */
+			fuzzystate->second = attnum;
+		}
+		else
+		{
+			/*
+			 * Record as provisional first match (this can occasionally
+			 * occur because previous lowest distance was "too low a
+			 * bar", rather than being associated with a real match)
+			 */
+			fuzzystate->first = attnum;
+		}
+	}
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +591,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname.  Such callers should not rely
+ * on the return value (even when there is an exact match), nor should they
+ * expect the usual side effect (unless there is an exact match).  This hardly
+ * matters in practice, since an error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, or if there are no matches, accurately report the shortest distance
+ * found overall while not setting a closest attribute.  Note that we never
+ * consider system column names when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, FuzzyAttrMatchState *fuzzystate)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +624,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +646,14 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		/*
+		 * Consider updating fuzzy state passed by callers concerned with
+		 * diagnostic messages.  Fuzzy state will be set for the best (or joint
+		 * best) matching colname observed so far.
+		 */
+		if (fuzzystate != NULL)
+			updateFuzzyAttrMatchState(fuzzystate, attcolname, colname, attnum);
 	}
 
 	/*
@@ -642,7 +730,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL);
 
 			if (newresult)
 			{
@@ -668,8 +757,15 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns state with relevant details.
+ *
+ * Column name may be matched fuzzily;  we provide the closet column(s) if
+ * there was not an exact match.  Caller can depend on returned state to find
+ * right attribute.  If first attribute is InvalidAttrNumber, but corresponding
+ * RTE is set, that indicates an exact match (i.e. column name is present, but
+ * presumably not visible).  However, if the wrong alias was specified by user,
+ * the first match attribute *is* set.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,10 +774,16 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static FuzzyAttrMatchState *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location)
 {
 	ParseState *orig_pstate = pstate;
+	FuzzyAttrMatchState *state = palloc(sizeof(FuzzyAttrMatchState));
+
+	state->distance = INT_MAX;
+	state->rsecond = state->rfirst = NULL;
+	state->second = state->first = InvalidAttrNumber;
 
 	while (pstate != NULL)
 	{
@@ -689,15 +791,132 @@ searchRangeTableForCol(ParseState *pstate, char *colname, int location)
 
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry	   *rte = (RangeTblEntry *) lfirst(l);
+			FuzzyAttrMatchState	rtestate;
+			bool				wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Typically, it is not useful to look for matches within join
+			 * RTEs;  they effectively duplicate other RTEs for our purposes,
+			 * and if a match is chosen from a join RTE, an unhelpful alias is
+			 * displayed in the final diagnostic message.
+			 */
+			if (rte->rtekind == RTE_JOIN)
+				continue;
+
+			/*
+			 * Get single best match (or pair of joint best matches, or no
+			 * match) from each RTE -- the best two columns ultimately
+			 * suggested may or may not both be from the same RTE.
+			 *
+			 * Initialize RTE's distance to INT_MAX (and not RT state's current
+			 * lowest distance) to ensure that per-RTE penalties do not distort
+			 * per-RT costing.
+			 */
+			rtestate.distance = INT_MAX;
+			rtestate.rsecond = rtestate.rfirst = NULL;
+			rtestate.second = rtestate.first = InvalidAttrNumber;
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rtestate);
+
+			/* Avoid totally non-matching RTEs (e.g. no RTE attributes) */
+			if (!AttributeNumberIsValid(rtestate.first))
+				continue;
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtestate.distance == 0)
+			{
+				/*
+				 * Exact match (for "wrong alias" or "wrong level" cases).
+				 *
+				 * Only consider first element for RTE, because there can only
+				 * be one exact match -- it doesn't seem worth considering the
+				 * case where there are multiple exact matches, so we're done.
+				 */
+				state->rfirst = rte;
+				state->first = wrongalias? rtestate.first : InvalidAttrNumber;
+				state->rsecond = NULL;
+				state->second = InvalidAttrNumber;
+
+				return state;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtestate.distance++;
+
+			if (rtestate.distance < state->distance)
+			{
+				/*
+				 * New, uncontested best match RTE, with 1 or 2 best match
+				 * columns
+				 */
+				state->distance = rtestate.distance;
+
+				state->rfirst = rte;
+				state->first = rtestate.first;
+				state->rsecond =
+					AttributeNumberIsValid(rtestate.second)? rte: NULL;
+				state->second = rtestate.second;
+			}
+			else if (rtestate.distance == state->distance)
+			{
+				/*
+				 * Can't have 3 or more matches at same distance.
+				 *
+				 * It's useful to provide two matches for the common case where
+				 * two range tables have single equidistant candidates, as when
+				 * an unqualified (and therefore would-be ambiguous) column
+				 * name is specified which is also misspelled by the user --
+				 * there is probably a foreign key relationship between
+				 * tables/RTEs.  It's also possible to usefully give two column
+				 * suggestions originating from the same RTE, which may be
+				 * useful when an alias strongly suggests that RTE, while there
+				 * are 2 somewhat close matches.
+				 *
+				 * However, when there are more than 2 equally distant matches,
+				 * that's probably because the matches are not useful at all,
+				 * so don't suggest anything.
+				 */
+				if (AttributeNumberIsValid(state->second) ||
+					AttributeNumberIsValid(rtestate.second))
+				{
+					/* 3 or more equidistant matches -- RTE is uninteresting */
+					state->rsecond = state->rfirst = NULL;
+					state->second = state->first = InvalidAttrNumber;
+				}
+				else
+				{
+					/* Record as provisional second match for RT */
+					Assert(state->rfirst != NULL &&
+						   AttributeNumberIsValid(state->first));
+					Assert(state->rsecond == NULL &&
+						   !AttributeNumberIsValid(state->second));
+					state->rsecond = rte;
+					state->second = rtestate.first;
+				}
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Final, absolute quality test:  distance must be less than a normalized
+	 * threshold in order to avoid completely ludicrous suggestions
+	 */
+	if (state->distance > strlen(colname) / 2)
+	{
+		state->rsecond = state->rfirst = NULL;
+		state->second = state->first = InvalidAttrNumber;
+	}
+
+	return state;
 }
 
 /*
@@ -2862,34 +3081,70 @@ void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	FuzzyAttrMatchState	   *state;
+	char				   *closestfirst = NULL;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
+	 *
+	 * TODO: improve this code (and also errorMissingRTE) to mention using
+	 * LATERAL if appropriate.
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	state = searchRangeTableForCol(pstate, relname, colname, location);
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
 	 *
-	 * TODO: improve this code (and also errorMissingRTE) to mention using
-	 * LATERAL if appropriate.
+	 * Extract closest col string for best match, if any.
+	 *
+	 * Infer an exact match referenced despite not being visible from the fact
+	 * that an attribute number was not present in state passed back -- this is
+	 * what is reported when !closestfirst.  There might also be an exact match
+	 * that was qualified with an incorrect alias, in which case closestfirst
+	 * will be set (so hint is the same as generic fuzzy case).
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	if (state->rfirst && AttributeNumberIsValid(state->first))
+		closestfirst = strVal(list_nth(state->rfirst->eref->colnames,
+									   state->first - 1));
+
+	if (!state->rsecond)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 state->rfirst? closestfirst?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, state->rfirst->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Handle case where there are two equally useful column hints */
+		char				   *closestsecond;
+
+		closestsecond = strVal(list_nth(state->rsecond->eref->colnames,
+										state->second - 1));
+
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst,
+						 state->rsecond->eref->aliasname, closestsecond),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
index a8670e9..8d565c6 100644
--- a/src/backend/utils/adt/levenshtein.c
+++ b/src/backend/utils/adt/levenshtein.c
@@ -95,6 +95,15 @@ varstr_levenshtein(const char *source, int slen, const char *target, int tlen,
 #define STOP_COLUMN m
 #endif
 
+	/*
+	 * A common use for Levenshtein distance is to match attributes when building
+	 * diagnostic, user-visible messages.  Restrict the size of
+	 * MAX_LEVENSHTEIN_STRLEN at compile time so that this is guaranteed to
+	 * work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
 	m = pg_mbstrlen_with_len(source, slen);
 	n = pg_mbstrlen_with_len(target, tlen);
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..b587abc 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -16,6 +16,24 @@
 
 #include "parser/parse_node.h"
 
+
+/*
+ * Support for fuzzily matching column.
+ *
+ * This is for building diagnostic messages, where non-exact matching
+ * attributes are suggested to the user.  The struct's fields may be facets of
+ * a particular RTE, or of an entire range table, depending on context.
+ */
+typedef struct
+{
+	int				distance;	/* Weighted distance (lowest so far) */
+	RangeTblEntry  *rfirst;		/* RTE of first */
+	AttrNumber		first;		/* Closest attribute so far */
+	RangeTblEntry  *rsecond;	/* RTE of second */
+	AttrNumber		second;		/* Second closest attribute so far */
+} FuzzyAttrMatchState;
+
+
 extern RangeTblEntry *refnameRangeTblEntry(ParseState *pstate,
 					 const char *schemaname,
 					 const char *refname,
@@ -35,7 +53,7 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, FuzzyAttrMatchState *fuzzystate);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..51db1b6 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2501184..1bb810d 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3415,6 +3421,38 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 718e1d9..ca7f966 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1051,6 +1055,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
1.9.1

#130Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#129)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Wed, Dec 3, 2014 at 9:21 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Dec 2, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name. I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case ("HINT: You must refer to this as column
as "prime_minister.id" rather than "cameron.id").

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

In that case, I think I favor giving no hint at all when the RTE name
is specified but doesn't match exactly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#130)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

In that case, I think I favor giving no hint at all when the RTE name
is specified but doesn't match exactly.

I don't follow. The existing mechanism only concerns what to do when
the original table name was used when an alias should have been used
instead. What does that have to do with this patch?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#131)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Fri, Dec 5, 2014 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Dec 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, if an alias is used, and you refer to an attribute using a
non-alias name (i.e. the original table name), then you'll already get
an error suggesting that the alias be used instead -- of course,
that's nothing new. It doesn't matter to the existing hinting
mechanism if the attribute name is otherwise wrong. Once you fix the
code to use the alias suggested, you'll then get this new
Levenshtein-based hint.

In that case, I think I favor giving no hint at all when the RTE name
is specified but doesn't match exactly.

I don't follow. The existing mechanism only concerns what to do when
the original table name was used when an alias should have been used
instead. What does that have to do with this patch?

Just that that's the case in which it seems useful to give a hint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#132)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Just that that's the case in which it seems useful to give a hint.

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint. Besides,
considering every visible RTE (while penalizing non-exact alias names
iff the user provided an alias name) is actually going to make bad
hints less likely, by increasing the number of equidistant low quality
matches in a way that swamps the mechanism into providing no actual
match at all. That's an important additional protection against low
quality matches.

What do other people think here?
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#133)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

"There is a column named \"%s\" in table \"%s\", but it cannot be
referenced from this part of the query."

I think it's pretty clear that this general class of user error is common.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#134)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

"There is a column named \"%s\" in table \"%s\", but it cannot be
referenced from this part of the query."

I think it's pretty clear that this general class of user error is common.

Moving this patch to CF 2014-12 as work is still going on, note that
it is currently marked with Robert as reviewer and that its current
status is "Needs review".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#135)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Dec 9, 2014 at 2:52 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 8, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:

I think it's very possible that the wrong alias may be provided by the
user, and that we should consider that when providing a hint.

Note that the existing mechanism (the mechanism that I'm trying to
improve) only ever shows this error message:

"There is a column named \"%s\" in table \"%s\", but it cannot be
referenced from this part of the query."

I think it's pretty clear that this general class of user error is common.

Moving this patch to CF 2014-12 as work is still going on, note that
it is currently marked with Robert as reviewer and that its current
status is "Needs review".

The status here is more like "waiting around to see if anyone else has
an opinion". The issue is what should happen when you enter qualified
name like alvaro.herrera and there is no column named anything like
herrara in the RTE named alvaro, but there is some OTHER RTE that
contains a column with a name that is only a small Levenshtein
distance away from herrera, like roberto.correra. The questions are:

1. Should we EVER give a you-might-have-meant hint in a case like this?
2. If so, does it matter whether the RTE name is just a bit different
from the actual RTE or whether it's completely different? In other
words, might we skip the hint in the above case but give one for
alvara.correra?

My current feeling is that we should answer #1 "no", but Peter prefers
to answer it "yes". My further feeling is that if we do decide to say
"yes" to #1, then I would answer #2 as "yes" also, but Peter would
answer it "no", assigning a fixed penalty for a mismatched RTE rather
than one that varies by the Levenshtein distance between the RTEs.

If no one else expresses an opinion, I'm going to insist on doing it
my way, but I'm happy to have other people weigh in.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#136)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

Robert Haas <robertmhaas@gmail.com> writes:

On Sun, Dec 14, 2014 at 8:24 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Moving this patch to CF 2014-12 as work is still going on, note that
it is currently marked with Robert as reviewer and that its current
status is "Needs review".

The status here is more like "waiting around to see if anyone else has
an opinion". The issue is what should happen when you enter qualified
name like alvaro.herrera and there is no column named anything like
herrara in the RTE named alvaro, but there is some OTHER RTE that
contains a column with a name that is only a small Levenshtein
distance away from herrera, like roberto.correra. The questions are:

1. Should we EVER give a you-might-have-meant hint in a case like this?
2. If so, does it matter whether the RTE name is just a bit different
from the actual RTE or whether it's completely different? In other
words, might we skip the hint in the above case but give one for
alvara.correra?

It would be astonishingly silly to not care about the RTE name's distance,
if you ask me. This is supposed to detect typos, not thinkos.

I think there might be some value in a separate heuristic that, when
you typed foo.bar and that doesn't match but there is a baz.bar, suggests
that maybe you meant baz.bar, even if baz is not close typo-wise. This
would be addressing the thinko case not the typo case, so the rules ought
to be quite different --- in particular I doubt that it'd be a good idea
to hint this way if the column names don't match exactly. But in any
case the key point is that this is a different heuristic addressing a
different failure mode. We should not try to make the
levenshtein-distance heuristic address that case.

So my two cents is that when considering a qualified name, this patch
should take levenshtein distance across the two components equally.
There's no good reason to suppose that typos will attack one name
component more (nor less) than the other.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#137)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So my two cents is that when considering a qualified name, this patch
should take levenshtein distance across the two components equally.
There's no good reason to suppose that typos will attack one name
component more (nor less) than the other.

Agreed (since it seems like folks are curious for the opinion's of
mostly bystanders).

+1 to the above for my part.

Thanks,

Stephen

#139Peter Geoghegan
pg@heroku.com
In reply to: Stephen Frost (#138)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Dec 16, 2014 at 12:18 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

So my two cents is that when considering a qualified name, this patch
should take levenshtein distance across the two components equally.
There's no good reason to suppose that typos will attack one name
component more (nor less) than the other.

Agreed (since it seems like folks are curious for the opinion's of
mostly bystanders).

+1 to the above for my part.

Okay, then. Attached patch implements this scheme. It is identical to
the previous revision, except that iff there was an alias specified
and that alias does not match the correct name (alias/table name) of
the RTE currently under consideration, we charge the distance between
the differing aliases rather than a fixed distance of 1.

--
Peter Geoghegan

Attachments:

0001-Levenshtein-distance-column-HINT.patchtext/x-patch; charset=US-ASCII; name=0001-Levenshtein-distance-column-HINT.patchDownload
From 91087191ba49e5fed7fdfa43f98deb009c2b3e0e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@heroku.com>
Date: Wed, 12 Nov 2014 15:31:37 -0800
Subject: [PATCH] Levenshtein distance column HINT

Add a new HINT -- a guess as to what column the user might have intended
to reference, to be shown in various contexts where an
ERRCODE_UNDEFINED_COLUMN error is raised.  The user will see this HINT
when he or she fat-fingers a column reference in an ad-hoc SQL query, or
incorrectly pluralizes or fails to pluralize a column reference, or
incorrectly omits or includes an underscore or other punctuation
character.

The HINT suggests a column in the range table with the lowest
Levenshtein distance, or the tied-for-best pair of matching columns in
the event of there being exactly two equally likely candidates (these
may come from multiple RTEs, or the same RTE).  Limiting to two the
number of cases where multiple equally likely suggestions are all
offered at once (i.e.  giving no hint when the number of equally likely
candidates exceeds two) is a measure against suggestions that are of low
quality in an absolute sense.

A further, final measure is taken against suggestions that are of low
absolute quality:  If the distance exceeds a normalized distance
threshold, no suggestion is given.
---
 src/backend/parser/parse_expr.c           |   9 +-
 src/backend/parser/parse_func.c           |   2 +-
 src/backend/parser/parse_relation.c       | 325 +++++++++++++++++++++++++++---
 src/backend/utils/adt/levenshtein.c       |   9 +
 src/include/parser/parse_relation.h       |  20 +-
 src/test/regress/expected/alter_table.out |   3 +
 src/test/regress/expected/join.out        |  38 ++++
 src/test/regress/sql/join.sql             |  24 +++
 8 files changed, 392 insertions(+), 38 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 4a8aaf6..a77a3a0 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -621,7 +621,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -666,7 +667,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -724,7 +726,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index 9ebd3fd..472e15e 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 478584d..e6adee1 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <limits.h>
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
@@ -520,6 +521,69 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * updateFuzzyAttrMatchState
+ *	  Using Levenshtein distance, consider if column is best fuzzy match.
+ */
+static void
+updateFuzzyAttrMatchState(FuzzyAttrMatchState *fuzzystate, const char *actual,
+						  const char *match, int attnum)
+{
+	int		columndistance;
+
+	/*
+	 * Outright reject dropped columns, which can appear here with apparent
+	 * empty actual names, per remarks within scanRTEForColumn().
+	 */
+	if (strcmp(actual, "") == 0)
+		columndistance = INT_MAX;
+	else
+		/* Use standard costs for Levenshtein distance */
+		columndistance = varstr_levenshtein_less_equal(actual, strlen(actual),
+													   match, strlen(match),
+													   1, 1, 1,
+													   fuzzystate->distance);
+
+	if (columndistance < fuzzystate->distance)
+	{
+		/* Store new lowest observed distance for RTE */
+		fuzzystate->distance = columndistance;
+		fuzzystate->first = attnum;
+		fuzzystate->second = InvalidAttrNumber;
+	}
+	else if (columndistance == fuzzystate->distance)
+	{
+		/*
+		 * This match distance may equal a prior match within this same
+		 * range table.  When that happens, the prior match may also be
+		 * given, but only if there is no more than two equally distant
+		 * matches from the RTE (in turn, our caller will only accept
+		 * two equally distant matches overall).
+		 */
+		if (AttributeNumberIsValid(fuzzystate->second))
+		{
+			/* Too many RTE-level matches */
+			fuzzystate->first = fuzzystate->second = InvalidAttrNumber;
+			/* Clearly, distance is too low a bar (for *any* RTE) */
+			fuzzystate->distance = columndistance - 1;
+		}
+		else if (AttributeNumberIsValid(fuzzystate->first))
+		{
+			/* Record as provisional second match for RTE */
+			fuzzystate->second = attnum;
+		}
+		else
+		{
+			/*
+			 * Record as provisional first match (this can occasionally
+			 * occur because previous lowest distance was "too low a
+			 * bar", rather than being associated with a real match)
+			 */
+			fuzzystate->first = attnum;
+		}
+	}
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +591,22 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * For those callers that will settle for a fuzzy match (for the purposes of
+ * building diagnostic messages), we match the column attribute whose name has
+ * the lowest Levenshtein distance from colname.  Such callers should not rely
+ * on the return value (even when there is an exact match), nor should they
+ * expect the usual side effect (unless there is an exact match).  This hardly
+ * matters in practice, since an error is imminent.
+ *
+ * If there are two or more attributes in the range table entry tied for
+ * closest, or if there are no matches, accurately report the shortest distance
+ * found overall while not setting a closest attribute.  Note that we never
+ * consider system column names when performing fuzzy matching.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, FuzzyAttrMatchState *fuzzystate)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +624,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +646,14 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		/*
+		 * Consider updating fuzzy state passed by callers concerned with
+		 * diagnostic messages.  Fuzzy state will be set for the best (or joint
+		 * best) matching colname observed so far.
+		 */
+		if (fuzzystate != NULL)
+			updateFuzzyAttrMatchState(fuzzystate, attcolname, colname, attnum);
 	}
 
 	/*
@@ -642,7 +730,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 NULL);
 
 			if (newresult)
 			{
@@ -668,8 +757,15 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns state with relevant details.
+ *
+ * Column name may be matched fuzzily;  we provide the closet column(s) if
+ * there was not an exact match.  Caller can depend on returned state to find
+ * right attribute.  If first attribute is InvalidAttrNumber, but corresponding
+ * RTE is set, that indicates an exact match (i.e. column name is present, but
+ * presumably not visible).  However, if the wrong alias was specified by user,
+ * the first match attribute *is* set.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -678,10 +774,16 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static FuzzyAttrMatchState *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location)
 {
 	ParseState *orig_pstate = pstate;
+	FuzzyAttrMatchState *state = palloc(sizeof(FuzzyAttrMatchState));
+
+	state->distance = INT_MAX;
+	state->rsecond = state->rfirst = NULL;
+	state->second = state->first = InvalidAttrNumber;
 
 	while (pstate != NULL)
 	{
@@ -689,15 +791,136 @@ searchRangeTableForCol(ParseState *pstate, char *colname, int location)
 
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry	   *rte = (RangeTblEntry *) lfirst(l);
+			FuzzyAttrMatchState	rtestate;
+			bool				wrongalias;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Typically, it is not useful to look for matches within join
+			 * RTEs;  they effectively duplicate other RTEs for our purposes,
+			 * and if a match is chosen from a join RTE, an unhelpful alias is
+			 * displayed in the final diagnostic message.
+			 */
+			if (rte->rtekind == RTE_JOIN)
+				continue;
+
+			/*
+			 * Get single best match (or pair of joint best matches, or no
+			 * match) from each RTE -- the best two columns ultimately
+			 * suggested may or may not both be from the same RTE.
+			 *
+			 * Initialize RTE's distance to INT_MAX (and not RT state's current
+			 * lowest distance) to ensure that per-RTE penalties do not distort
+			 * per-RT costing.
+			 */
+			rtestate.distance = INT_MAX;
+			rtestate.rsecond = rtestate.rfirst = NULL;
+			rtestate.second = rtestate.first = InvalidAttrNumber;
+			scanRTEForColumn(orig_pstate, rte, colname, location, &rtestate);
+
+			/* Avoid totally non-matching RTEs (e.g. no RTE attributes) */
+			if (!AttributeNumberIsValid(rtestate.first))
+				continue;
+
+			/* Was alias provided by user that does not match entry's alias? */
+			wrongalias = (alias && strcmp(alias, rte->eref->aliasname) != 0);
+
+			if (rtestate.distance == 0)
+			{
+				/*
+				 * Exact match (for "wrong alias" or "wrong level" cases).
+				 *
+				 * Only consider first element for RTE, because there can only
+				 * be one exact match -- it doesn't seem worth considering the
+				 * case where there are multiple exact matches, so we're done.
+				 */
+				state->rfirst = rte;
+				state->first = wrongalias? rtestate.first : InvalidAttrNumber;
+				state->rsecond = NULL;
+				state->second = InvalidAttrNumber;
+
+				return state;
+			}
+
+			/*
+			 * Charge extra (for inexact matches only) when an alias was
+			 * specified that differs from what might have been used to
+			 * correctly qualify this RTE's closest column
+			 */
+			if (wrongalias)
+				rtestate.distance += varstr_levenshtein(alias,
+														strlen(alias),
+														rte->eref->aliasname,
+														strlen(rte->eref->aliasname),
+														1, 1, 1);
+
+			if (rtestate.distance < state->distance)
+			{
+				/*
+				 * New, uncontested best match RTE, with 1 or 2 best match
+				 * columns
+				 */
+				state->distance = rtestate.distance;
+
+				state->rfirst = rte;
+				state->first = rtestate.first;
+				state->rsecond =
+					AttributeNumberIsValid(rtestate.second)? rte: NULL;
+				state->second = rtestate.second;
+			}
+			else if (rtestate.distance == state->distance)
+			{
+				/*
+				 * Can't have 3 or more matches at same distance.
+				 *
+				 * It's useful to provide two matches for the common case where
+				 * two range tables have single equidistant candidates, as when
+				 * an unqualified (and therefore would-be ambiguous) column
+				 * name is specified which is also misspelled by the user --
+				 * there is probably a foreign key relationship between
+				 * tables/RTEs.  It's also possible to usefully give two column
+				 * suggestions originating from the same RTE, which may be
+				 * useful when an alias strongly suggests that RTE, while there
+				 * are 2 somewhat close matches.
+				 *
+				 * However, when there are more than 2 equally distant matches,
+				 * that's probably because the matches are not useful at all,
+				 * so don't suggest anything.
+				 */
+				if (AttributeNumberIsValid(state->second) ||
+					AttributeNumberIsValid(rtestate.second))
+				{
+					/* 3 or more equidistant matches -- RTE is uninteresting */
+					state->rsecond = state->rfirst = NULL;
+					state->second = state->first = InvalidAttrNumber;
+				}
+				else
+				{
+					/* Record as provisional second match for RT */
+					Assert(state->rfirst != NULL &&
+						   AttributeNumberIsValid(state->first));
+					Assert(state->rsecond == NULL &&
+						   !AttributeNumberIsValid(state->second));
+					state->rsecond = rte;
+					state->second = rtestate.first;
+				}
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	/*
+	 * Final, absolute quality test:  distance must be less than a normalized
+	 * threshold in order to avoid completely ludicrous suggestions
+	 */
+	if (state->distance > strlen(colname) / 2)
+	{
+		state->rsecond = state->rfirst = NULL;
+		state->second = state->first = InvalidAttrNumber;
+	}
+
+	return state;
 }
 
 /*
@@ -2862,34 +3085,70 @@ void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	FuzzyAttrMatchState	   *state;
+	char				   *closestfirst = NULL;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
+	 *
+	 * TODO: improve this code (and also errorMissingRTE) to mention using
+	 * LATERAL if appropriate.
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	state = searchRangeTableForCol(pstate, relname, colname, location);
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * In practice a bad qualification name should end up at errorMissingRTE,
+	 * not here, so no need to work hard on this case.
 	 *
-	 * TODO: improve this code (and also errorMissingRTE) to mention using
-	 * LATERAL if appropriate.
+	 * Extract closest col string for best match, if any.
+	 *
+	 * Infer an exact match referenced despite not being visible from the fact
+	 * that an attribute number was not present in state passed back -- this is
+	 * what is reported when !closestfirst.  There might also be an exact match
+	 * that was qualified with an incorrect alias, in which case closestfirst
+	 * will be set (so hint is the same as generic fuzzy case).
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	if (state->rfirst && AttributeNumberIsValid(state->first))
+		closestfirst = strVal(list_nth(state->rfirst->eref->colnames,
+									   state->first - 1));
+
+	if (!state->rsecond)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 state->rfirst? closestfirst?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, state->rfirst->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Handle case where there are two equally useful column hints */
+		char				   *closestsecond;
+
+		closestsecond = strVal(list_nth(state->rsecond->eref->colnames,
+										state->second - 1));
+
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst,
+						 state->rsecond->eref->aliasname, closestsecond),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
index a8670e9..8d565c6 100644
--- a/src/backend/utils/adt/levenshtein.c
+++ b/src/backend/utils/adt/levenshtein.c
@@ -95,6 +95,15 @@ varstr_levenshtein(const char *source, int slen, const char *target, int tlen,
 #define STOP_COLUMN m
 #endif
 
+	/*
+	 * A common use for Levenshtein distance is to match attributes when building
+	 * diagnostic, user-visible messages.  Restrict the size of
+	 * MAX_LEVENSHTEIN_STRLEN at compile time so that this is guaranteed to
+	 * work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
 	m = pg_mbstrlen_with_len(source, slen);
 	n = pg_mbstrlen_with_len(target, tlen);
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index d8b9493..b587abc 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -16,6 +16,24 @@
 
 #include "parser/parse_node.h"
 
+
+/*
+ * Support for fuzzily matching column.
+ *
+ * This is for building diagnostic messages, where non-exact matching
+ * attributes are suggested to the user.  The struct's fields may be facets of
+ * a particular RTE, or of an entire range table, depending on context.
+ */
+typedef struct
+{
+	int				distance;	/* Weighted distance (lowest so far) */
+	RangeTblEntry  *rfirst;		/* RTE of first */
+	AttrNumber		first;		/* Closest attribute so far */
+	RangeTblEntry  *rsecond;	/* RTE of second */
+	AttrNumber		second;		/* Second closest attribute so far */
+} FuzzyAttrMatchState;
+
+
 extern RangeTblEntry *refnameRangeTblEntry(ParseState *pstate,
 					 const char *schemaname,
 					 const char *refname,
@@ -35,7 +53,7 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location, FuzzyAttrMatchState *fuzzystate);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..51db1b6 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 2501184..1bb810d 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3415,6 +3421,38 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 718e1d9..ca7f966 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1051,6 +1055,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
-- 
1.9.1

#140Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#139)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch implements this scheme.

I had another thought: "NAMEDATALEN + 1" is a better representation of
"infinity" for matching purposes than INT_MAX. I probably should have
made that change, too. It would then not have been necessary to
"#include <limits.h>". I think that this is a useful
belt-and-suspenders precaution against integer overflow. It almost
certainly won't matter, since it's very unlikely that the best match
within an RTE will end up being a dropped column, but we might as well
do it that way (Levenshtein distance is costed in multiples of code
point changes, but the maximum density is 1 byte per codepoint).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#140)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Sat, Dec 20, 2014 at 7:30 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Dec 20, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch implements this scheme.

I had another thought: "NAMEDATALEN + 1" is a better representation of
"infinity" for matching purposes than INT_MAX. I probably should have
made that change, too. It would then not have been necessary to
"#include <limits.h>". I think that this is a useful
belt-and-suspenders precaution against integer overflow. It almost
certainly won't matter, since it's very unlikely that the best match
within an RTE will end up being a dropped column, but we might as well
do it that way (Levenshtein distance is costed in multiples of code
point changes, but the maximum density is 1 byte per codepoint).

Good idea.

Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects. Instead of
creating a separate one for each RTE and then merging them, just have
one. When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs. Then call scanRTEForColumn()
and pass down the same state object. Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#141)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects. Instead of
creating a separate one for each RTE and then merging them, just have
one. When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs. Then call scanRTEForColumn()
and pass down the same state object. Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.

I'm afraid I don't follow. I think doing things that way makes things
less clear. Merging is useful because it allows us to consider that an
exact match might exist, which this searchRangeTableForCol() is
already tasked with today. We now look for the best match
exhaustively, or magically return immediately in the event of an exact
match, without caring about the alias correctness or distance.

Having a separate object makes this pattern apparent from the top
level, within searchRangeTableForCol(). I feel that's better.
updateFuzzyAttrMatchState() is the wrong place to put that, because
that task rightfully belongs in searchRangeTableForCol(), where the
high level diagnostic-report-generating control flow lives.

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
function, and right now I am only making it slightly less clear by
tasking it with caring about distance of names on top of strict binary
equality of attribute names. I don't want to push it any further.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#142)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState().

Excuse me. I mean *not* creating a separate object -- having a unified
state representation for the entire range-table, rather than having
one per RTE and merging them one by one into an overall/final range
table object.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#143)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Dec 23, 2014 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 22, 2014 at 4:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState().

Excuse me. I mean *not* creating a separate object -- having a unified
state representation for the entire range-table, rather than having
one per RTE and merging them one by one into an overall/final range
table object.

Patch moved to CF 2015-02.
--
Michael

#145Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#142)
1 attachment(s)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Mon, Dec 22, 2014 at 7:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Dec 22, 2014 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Looking over the latest patch, I think we could simplify the code so
that you don't need multiple FuzzyAttrMatchState objects. Instead of
creating a separate one for each RTE and then merging them, just have
one. When you find an inexact-RTE name match, set a field inside the
FuzzyAttrMatchState -- maybe with a name like rte_penalty -- to the
Levenshtein distance between the RTEs. Then call scanRTEForColumn()
and pass down the same state object. Now let
updateFuzzyAttrMatchState() work out what it needs to do based on the
observed inter-column distance and the currently-in-force RTE penalty.

I'm afraid I don't follow. I think doing things that way makes things
less clear. Merging is useful because it allows us to consider that an
exact match might exist, which this searchRangeTableForCol() is
already tasked with today. We now look for the best match
exhaustively, or magically return immediately in the event of an exact
match, without caring about the alias correctness or distance.

Having a separate object makes this pattern apparent from the top
level, within searchRangeTableForCol(). I feel that's better.
updateFuzzyAttrMatchState() is the wrong place to put that, because
that task rightfully belongs in searchRangeTableForCol(), where the
high level diagnostic-report-generating control flow lives.

To put it another way, creating a separate object obfuscates
scanRTEForColumn(), since it's the only client of
updateFuzzyAttrMatchState(). scanRTEForColumn() is a very important
function, and right now I am only making it slightly less clear by
tasking it with caring about distance of names on top of strict binary
equality of attribute names. I don't want to push it any further.

I don't buy this. What you're essentially doing is using the
FuzzyAttrMatchState object in two ways that are not entirely
compatible with each other - updateFuzzyAttrMatchState doesn't set the
RTE fields, so searchRangeTableForCol has to do it. So there's an
unspoken contract that in some parts of the code, you can rely on
those fields being set, and in others, you can't. That pretty much
defeats the whole point of making the state its own object, AFAICS.
Furthermore, you end up with two copies of the state-combining logic,
one in FuzzyAttrMatchState and a second in searchRangeTableForCol.
That's ugly and unnecessary.

I decided to rework this patch myself today; my version is attached.
I believe that this version is significantly easier to understand than
yours, both as to the code and the comments. I put quite a bit of
work into both. I also suspect it's more efficient, because it avoids
computing the Levenshtein distances for column names when we already
know that those column names can't possibly be sufficiently-good
matches for us to care about the details; and when it does compute the
Levenshtein distance it keeps the max-distance threshold as low as
possible. That may not really matter much, but it can't hurt. More
importantly, essentially all of the fuzzy-matching logic is now
isolated in FuzzyAttrMatchState(); the volume of change in
scanRTEForColumn is the same as in your version, but the volume of
change in searchRangeTableForCol is quite a bit less, so the patch is
smaller overall.

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

column-hint-rmh.patchbinary/octet-stream; name=column-hint-rmh.patchDownload
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 7829bcb..130e52b 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -556,7 +556,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field2);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										0, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -601,7 +602,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field3);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										0, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
@@ -659,7 +661,8 @@ transformColumnRef(ParseState *pstate, ColumnRef *cref)
 				colname = strVal(field4);
 
 				/* Try to identify as a column of the RTE */
-				node = scanRTEForColumn(pstate, rte, colname, cref->location);
+				node = scanRTEForColumn(pstate, rte, colname, cref->location,
+										0, NULL);
 				if (node == NULL)
 				{
 					/* Try it as a function call on the whole row */
diff --git a/src/backend/parser/parse_func.c b/src/backend/parser/parse_func.c
index a200804..53bbaec 100644
--- a/src/backend/parser/parse_func.c
+++ b/src/backend/parser/parse_func.c
@@ -1779,7 +1779,7 @@ ParseComplexProjection(ParseState *pstate, char *funcname, Node *first_arg,
 									 ((Var *) first_arg)->varno,
 									 ((Var *) first_arg)->varlevelsup);
 		/* Return a Var if funcname matches a column, else NULL */
-		return scanRTEForColumn(pstate, rte, funcname, location);
+		return scanRTEForColumn(pstate, rte, funcname, location, 0, NULL);
 	}
 
 	/*
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index f416fc2..80daeb9 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -33,6 +33,8 @@
 #include "utils/syscache.h"
 
 
+#define MAX_FUZZY_DISTANCE				3
+
 static RangeTblEntry *scanNameSpaceForRefname(ParseState *pstate,
 						const char *refname, int location);
 static RangeTblEntry *scanNameSpaceForRelid(ParseState *pstate, Oid relid,
@@ -520,6 +522,101 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
 }
 
 /*
+ * updateFuzzyAttrMatchState
+ *	  Using Levenshtein distance, consider if column is best fuzzy match.
+ */
+static void
+updateFuzzyAttrMatchState(int fuzzy_rte_penalty,
+						  FuzzyAttrMatchState *fuzzystate, RangeTblEntry *rte,
+						  const char *actual, const char *match, int attnum)
+{
+	int		columndistance;
+	int		matchlen;
+
+	/* Bail before computing the Levenshtein distance if there's no hope. */
+	if (fuzzy_rte_penalty > fuzzystate->distance)
+		return;
+
+	/*
+	 * Outright reject dropped columns, which can appear here with apparent
+	 * empty actual names, per remarks within scanRTEForColumn().
+	 */
+	if (actual[0] == '\0')
+		return;
+
+	/* Use Levenshtein to compute match distance. */
+	matchlen = strlen(match);
+	columndistance =
+		varstr_levenshtein_less_equal(actual, strlen(actual), match, matchlen,
+									  1, 1, 1,
+									  fuzzystate->distance + 1
+										- fuzzy_rte_penalty);
+
+	/*
+	 * If more than half the characters are different, don't treat it as a
+	 * match, to avoid making ridiculous suggestions.
+	 */
+	if (columndistance > matchlen / 2)
+		return;
+
+	/*
+	 * From this point on, we can ignore the distinction between the
+	 * RTE-name distance and the column-name distance.
+	 */
+	columndistance += fuzzy_rte_penalty;
+
+	/*
+	 * If the new distance is less than or equal to that of the best match
+	 * found so far, update fuzzystate.
+	 */
+	if (columndistance < fuzzystate->distance)
+	{
+		/* Store new lowest observed distance for RTE */
+		fuzzystate->distance = columndistance;
+		fuzzystate->rfirst = rte;
+		fuzzystate->first = attnum;
+		fuzzystate->rsecond = NULL;
+		fuzzystate->second = InvalidAttrNumber;
+	}
+	else if (columndistance == fuzzystate->distance)
+	{
+		/*
+		 * This match distance may equal a prior match within this same
+		 * range table.  When that happens, the prior match may also be
+		 * given, but only if there is no more than two equally distant
+		 * matches from the RTE (in turn, our caller will only accept
+		 * two equally distant matches overall).
+		 */
+		if (AttributeNumberIsValid(fuzzystate->second))
+		{
+			/* Too many RTE-level matches */
+			fuzzystate->rfirst = NULL;
+			fuzzystate->first = InvalidAttrNumber;
+			fuzzystate->rsecond = NULL;
+			fuzzystate->second = InvalidAttrNumber;
+			/* Clearly, distance is too low a bar (for *any* RTE) */
+			fuzzystate->distance = columndistance - 1;
+		}
+		else if (AttributeNumberIsValid(fuzzystate->first))
+		{
+			/* Record as provisional second match for RTE */
+			fuzzystate->rsecond = rte;
+			fuzzystate->second = attnum;
+		}
+		else if (fuzzystate->distance <= MAX_FUZZY_DISTANCE)
+		{
+			/*
+			 * Record as provisional first match (this can occasionally
+			 * occur because previous lowest distance was "too low a
+			 * bar", rather than being associated with a real match)
+			 */
+			fuzzystate->rfirst = rte;
+			fuzzystate->first = attnum;
+		}
+	}
+}
+
+/*
  * scanRTEForColumn
  *	  Search the column names of a single RTE for the given name.
  *	  If found, return an appropriate Var node, else return NULL.
@@ -527,10 +624,14 @@ GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte, int rtelevelsup)
  *
  * Side effect: if we find a match, mark the RTE as requiring read access
  * for the column.
+ *
+ * Additional side effect: if fuzzystate is non-NULL, check non-system columns
+ * for an approximate match and update fuzzystate accordingly.
  */
 Node *
 scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
-				 int location)
+				 int location, int fuzzy_rte_penalty,
+				 FuzzyAttrMatchState *fuzzystate)
 {
 	Node	   *result = NULL;
 	int			attnum = 0;
@@ -548,12 +649,16 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 	 * Should this somehow go wrong and we try to access a dropped column,
 	 * we'll still catch it by virtue of the checks in
 	 * get_rte_attribute_type(), which is called by make_var().  That routine
-	 * has to do a cache lookup anyway, so the check there is cheap.
+	 * has to do a cache lookup anyway, so the check there is cheap.  Callers
+	 * interested in finding match with shortest distance need to defend
+	 * against this directly, though.
 	 */
 	foreach(c, rte->eref->colnames)
 	{
+		const char *attcolname = strVal(lfirst(c));
+
 		attnum++;
-		if (strcmp(strVal(lfirst(c)), colname) == 0)
+		if (strcmp(attcolname, colname) == 0)
 		{
 			if (result)
 				ereport(ERROR,
@@ -566,6 +671,11 @@ scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte, char *colname,
 			markVarForSelectPriv(pstate, var, rte);
 			result = (Node *) var;
 		}
+
+		/* Updating fuzzy match state, if provided. */
+		if (fuzzystate != NULL)
+			updateFuzzyAttrMatchState(fuzzy_rte_penalty, fuzzystate,
+									  rte, attcolname, colname, attnum);
 	}
 
 	/*
@@ -642,7 +752,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 				continue;
 
 			/* use orig_pstate here to get the right sublevels_up */
-			newresult = scanRTEForColumn(orig_pstate, rte, colname, location);
+			newresult = scanRTEForColumn(orig_pstate, rte, colname, location,
+										 0, NULL);
 
 			if (newresult)
 			{
@@ -668,8 +779,8 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
 
 /*
  * searchRangeTableForCol
- *	  See if any RangeTblEntry could possibly provide the given column name.
- *	  If so, return a pointer to the RangeTblEntry; else return NULL.
+ *	  See if any RangeTblEntry could possibly provide the given column name (or
+ *	  find the best match available).  Returns state with relevant details.
  *
  * This is different from colNameToVar in that it considers every entry in
  * the ParseState's rangetable(s), not only those that are currently visible
@@ -677,11 +788,31 @@ colNameToVar(ParseState *pstate, char *colname, bool localonly,
  * and it may give ambiguous results (there might be multiple equally valid
  * matches, but only one will be returned).  This must be used ONLY as a
  * heuristic in giving suitable error messages.  See errorMissingColumn.
+ *
+ * This function is also different in that it will consider approximate
+ * matches -- if the user entered an alias/column pair that is only slightly
+ * different from a valid pair, we may be able to infer what they meant to
+ * type and provide a reasonable hint.
+ *
+ * The FuzzyAttrMatchState will have 'rfirst' pointing to the best RTE
+ * containing the most promising match for the alias and column name.  If
+ * the alias and column names match exactly, 'first' will be InvalidAttrNumber;
+ * otherwise, it will be the attribute number for the match.  In the latter
+ * case, 'rsecond' may point to a second, equally close approximate match,
+ * and 'second' will contain the attribute number for the second match.
  */
-static RangeTblEntry *
-searchRangeTableForCol(ParseState *pstate, char *colname, int location)
+static FuzzyAttrMatchState *
+searchRangeTableForCol(ParseState *pstate, const char *alias, char *colname,
+					   int location)
 {
 	ParseState *orig_pstate = pstate;
+	FuzzyAttrMatchState *fuzzystate = palloc(sizeof(FuzzyAttrMatchState));
+
+	fuzzystate->distance = MAX_FUZZY_DISTANCE + 1;
+	fuzzystate->rfirst = NULL;
+	fuzzystate->rsecond = NULL;
+	fuzzystate->first = InvalidAttrNumber;
+	fuzzystate->second = InvalidAttrNumber;
 
 	while (pstate != NULL)
 	{
@@ -689,15 +820,51 @@ searchRangeTableForCol(ParseState *pstate, char *colname, int location)
 
 		foreach(l, pstate->p_rtable)
 		{
-			RangeTblEntry *rte = (RangeTblEntry *) lfirst(l);
+			RangeTblEntry	   *rte = (RangeTblEntry *) lfirst(l);
+			int					fuzzy_rte_penalty = 0;
 
-			if (scanRTEForColumn(orig_pstate, rte, colname, location))
-				return rte;
+			/*
+			 * Typically, it is not useful to look for matches within join
+			 * RTEs; they effectively duplicate other RTEs for our purposes,
+			 * and if a match is chosen from a join RTE, an unhelpful alias is
+			 * displayed in the final diagnostic message.
+			 */
+			if (rte->rtekind == RTE_JOIN)
+				continue;
+
+			/*
+			 * If the user didn't specify an alias, then matches against one
+			 * RTE are as good as another.  But if the user did specify an
+			 * alias, then we want at least a fuzzy - and preferably an exact
+			 * - match for the range table entry.
+			 */
+			if (alias != NULL)
+				fuzzy_rte_penalty =
+					varstr_levenshtein(alias, strlen(alias),
+									   rte->eref->aliasname,
+									   strlen(rte->eref->aliasname),
+									   1, 1, 1);
+
+			/*
+			 * Scan for a matching column; if we find an exact match, we're
+			 * done.  Otherwise, update fuzzystate.
+			 */
+			if (scanRTEForColumn(orig_pstate, rte, colname, location,
+								 fuzzy_rte_penalty, fuzzystate)
+					&& fuzzy_rte_penalty == 0)
+			{
+				fuzzystate->rfirst = rte;
+				fuzzystate->first = InvalidAttrNumber;
+				fuzzystate->rsecond = NULL;
+				fuzzystate->second = InvalidAttrNumber;
+				return fuzzystate;
+			}
 		}
 
 		pstate = pstate->parentParseState;
 	}
-	return NULL;
+
+	return fuzzystate;
 }
 
 /*
@@ -2860,34 +3027,67 @@ void
 errorMissingColumn(ParseState *pstate,
 				   char *relname, char *colname, int location)
 {
-	RangeTblEntry *rte;
+	FuzzyAttrMatchState	   *state;
+	char				   *closestfirst = NULL;
 
 	/*
-	 * If relname was given, just play dumb and report it.  (In practice, a
-	 * bad qualification name should end up at errorMissingRTE, not here, so
-	 * no need to work hard on this case.)
+	 * Search the entire rtable looking for possible matches.  If we find one,
+	 * emit a hint about it.
+	 *
+	 * TODO: improve this code (and also errorMissingRTE) to mention using
+	 * LATERAL if appropriate.
 	 */
-	if (relname)
-		ereport(ERROR,
-				(errcode(ERRCODE_UNDEFINED_COLUMN),
-				 errmsg("column %s.%s does not exist", relname, colname),
-				 parser_errposition(pstate, location)));
+	state = searchRangeTableForCol(pstate, relname, colname, location);
 
 	/*
-	 * Otherwise, search the entire rtable looking for possible matches.  If
-	 * we find one, emit a hint about it.
+	 * Extract closest col string for best match, if any.
 	 *
-	 * TODO: improve this code (and also errorMissingRTE) to mention using
-	 * LATERAL if appropriate.
+	 * Infer an exact match referenced despite not being visible from the fact
+	 * that an attribute number was not present in state passed back -- this is
+	 * what is reported when !closestfirst.  There might also be an exact match
+	 * that was qualified with an incorrect alias, in which case closestfirst
+	 * will be set (so hint is the same as generic fuzzy case).
 	 */
-	rte = searchRangeTableForCol(pstate, colname, location);
-
-	ereport(ERROR,
-			(errcode(ERRCODE_UNDEFINED_COLUMN),
-			 errmsg("column \"%s\" does not exist", colname),
-			 rte ? errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
-						   colname, rte->eref->aliasname) : 0,
-			 parser_errposition(pstate, location)));
+	if (state->rfirst && AttributeNumberIsValid(state->first))
+		closestfirst = strVal(list_nth(state->rfirst->eref->colnames,
+									   state->first - 1));
+
+	if (!state->rsecond)
+	{
+		/*
+		 * Handle case where there is zero or one column suggestions to hint,
+		 * including exact matches referenced but not visible.
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname ?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 state->rfirst ? closestfirst ?
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst):
+				 errhint("There is a column named \"%s\" in table \"%s\", but it cannot be referenced from this part of the query.",
+						 colname, state->rfirst->eref->aliasname): 0,
+				 parser_errposition(pstate, location)));
+	}
+	else
+	{
+		/* Handle case where there are two equally useful column hints */
+		char				   *closestsecond;
+
+		closestsecond = strVal(list_nth(state->rsecond->eref->colnames,
+										state->second - 1));
+
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_COLUMN),
+				 relname ?
+				 errmsg("column %s.%s does not exist", relname, colname):
+				 errmsg("column \"%s\" does not exist", colname),
+				 errhint("Perhaps you meant to reference the column \"%s\".\"%s\" or the column \"%s\".\"%s\".",
+						 state->rfirst->eref->aliasname, closestfirst,
+						 state->rsecond->eref->aliasname, closestsecond),
+				 parser_errposition(pstate, location)));
+	}
 }
 
 
diff --git a/src/backend/utils/adt/levenshtein.c b/src/backend/utils/adt/levenshtein.c
index 3669adc..f6e2ca6 100644
--- a/src/backend/utils/adt/levenshtein.c
+++ b/src/backend/utils/adt/levenshtein.c
@@ -95,6 +95,15 @@ varstr_levenshtein(const char *source, int slen, const char *target, int tlen,
 #define STOP_COLUMN m
 #endif
 
+	/*
+	 * A common use for Levenshtein distance is to match attributes when building
+	 * diagnostic, user-visible messages.  Restrict the size of
+	 * MAX_LEVENSHTEIN_STRLEN at compile time so that this is guaranteed to
+	 * work.
+	 */
+	StaticAssertStmt(NAMEDATALEN <= MAX_LEVENSHTEIN_STRLEN,
+					 "Levenshtein hinting mechanism restricts NAMEDATALEN");
+
 	m = pg_mbstrlen_with_len(source, slen);
 	n = pg_mbstrlen_with_len(target, tlen);
 
diff --git a/src/include/parser/parse_relation.h b/src/include/parser/parse_relation.h
index c886335..b2f804a 100644
--- a/src/include/parser/parse_relation.h
+++ b/src/include/parser/parse_relation.h
@@ -16,6 +16,24 @@
 
 #include "parser/parse_node.h"
 
+
+/*
+ * Support for fuzzily matching column.
+ *
+ * This is for building diagnostic messages, where non-exact matching
+ * attributes are suggested to the user.  The struct's fields may be facets of
+ * a particular RTE, or of an entire range table, depending on context.
+ */
+typedef struct
+{
+	int				distance;	/* Weighted distance (lowest so far) */
+	RangeTblEntry  *rfirst;		/* RTE of first */
+	AttrNumber		first;		/* Closest attribute so far */
+	RangeTblEntry  *rsecond;	/* RTE of second */
+	AttrNumber		second;		/* Second closest attribute so far */
+} FuzzyAttrMatchState;
+
+
 extern RangeTblEntry *refnameRangeTblEntry(ParseState *pstate,
 					 const char *schemaname,
 					 const char *refname,
@@ -35,7 +53,8 @@ extern RangeTblEntry *GetRTEByRangeTablePosn(ParseState *pstate,
 extern CommonTableExpr *GetCTEForRTE(ParseState *pstate, RangeTblEntry *rte,
 			 int rtelevelsup);
 extern Node *scanRTEForColumn(ParseState *pstate, RangeTblEntry *rte,
-				 char *colname, int location);
+				 char *colname, int location,
+				 int fuzzy_rte_penalty, FuzzyAttrMatchState *fuzzystate);
 extern Node *colNameToVar(ParseState *pstate, char *colname, bool localonly,
 			 int location);
 extern void markVarForSelectPriv(ParseState *pstate, Var *var,
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index d233710..51db1b6 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -536,6 +536,7 @@ create table atacc1 ( test int );
 -- add a check constraint (fails)
 alter table atacc1 add constraint atacc_test1 check (test1>3);
 ERROR:  column "test1" does not exist
+HINT:  Perhaps you meant to reference the column "atacc1"."test".
 drop table atacc1;
 -- something a little more complicated
 create table atacc1 ( test int, test2 int, test3 int);
@@ -1342,6 +1343,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
@@ -1355,6 +1357,7 @@ select f1 from c1;
 ERROR:  column "f1" does not exist
 LINE 1: select f1 from c1;
                ^
+HINT:  Perhaps you meant to reference the column "c1"."f2".
 drop table p1 cascade;
 NOTICE:  drop cascades to table c1
 create table p1 (f1 int, f2 int);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index ca3a17b..1e3fe07 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2222,6 +2222,12 @@ select * from t1 left join t2 on (t1.a = t2.a);
  200 | 1000 | 200 | 2001
 (5 rows)
 
+-- Test matching of column name with wrong alias
+select t1.x from t1 join t3 on (t1.a = t3.x);
+ERROR:  column t1.x does not exist
+LINE 1: select t1.x from t1 join t3 on (t1.a = t3.x);
+               ^
+HINT:  Perhaps you meant to reference the column "t3"."x".
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -3434,6 +3440,38 @@ select * from
 (0 rows)
 
 --
+-- Test hints given on incorrect column references are useful
+--
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+ERROR:  column t1.uunique1 does not exist
+LINE 1: select t1.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1".
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+ERROR:  column t2.uunique1 does not exist
+LINE 1: select t2.uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t2"."unique1".
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+ERROR:  column "uunique1" does not exist
+LINE 1: select uunique1 from
+               ^
+HINT:  Perhaps you meant to reference the column "t1"."unique1" or the column "t2"."unique1".
+--
+-- Take care to reference the correct RTE
+--
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+ERROR:  column atts.relid does not exist
+LINE 1: select atts.relid::regclass, s.* from pg_stats s join
+               ^
+--
 -- Test LATERAL
 --
 select unique2, x.*
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 6005476..7a08bdf 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -397,6 +397,10 @@ insert into t2a values (200, 2001);
 
 select * from t1 left join t2 on (t1.a = t2.a);
 
+-- Test matching of column name with wrong alias
+
+select t1.x from t1 join t3 on (t1.a = t3.x);
+
 --
 -- regression test for 8.1 merge right join bug
 --
@@ -1060,6 +1064,26 @@ select * from
   int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1; -- ok
 
 --
+-- Test hints given on incorrect column references are useful
+--
+
+select t1.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t1" suggestipn
+select t2.uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, prefer "t2" suggestion
+select uunique1 from
+  tenk1 t1 join tenk2 t2 on t1.two = t2.two; -- error, suggest both at once
+
+--
+-- Take care to reference the correct RTE
+--
+
+select atts.relid::regclass, s.* from pg_stats s join
+    pg_attribute a on s.attname = a.attname and s.tablename =
+    a.attrelid::regclass::text join (select unnest(indkey) attnum,
+    indexrelid from pg_index i) atts on atts.attnum = a.attnum where
+    schemaname != 'pg_catalog';
+--
 -- Test LATERAL
 --
 
#146Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#145)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

Looks good to me. Thanks.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#146)
Re: Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Mar 10, 2015 at 4:03 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Mar 10, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm prepared to commit this version if nobody finds a problem with it.
It passes the additional regression tests you wrote.

Looks good to me. Thanks.

OK, committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers