Re: Permute underscore separated components of columns before fuzzy matching

Started by Arne Rolandover 2 years ago6 messageshackers
Jump to latest
#1Arne Roland
arne.roland@malkut.net

Hi!

Mikhail Gribkov <youzhick(at)gmail(dot)com> writes:

Honestly I'm not entirely sure fixing only two switched words is

worth the

effort, but the declared goal is clearly achieved.

I think the patch is good to go, although you need to fix code

formatting.

I took a brief look at this.  I concur that we shouldn't need to be
hugely concerned about the speed of this code path.  However, we *do*
need to be concerned about its maintainability, and I think the patch
falls down badly there: it adds a chunk of very opaque and essentially
undocumented code, that people will need to reverse-engineer anytime
they are studying this function.  That could be alleviated perhaps
with more work on comments, but I have to wonder whether it's worth
carrying this logic at all.  It's a rather strange behavior to add,
and I wonder if many users will want it.

I encounter this problem all the time. I don't know, whether my clients
are representative. But I see the problem, when the developers show me
their code base all the time.
It's an issue for column names and table names alike. I personally spent
hours watching developers trying various permutations.
They rarely request this feature. Usually they are to embarrassed for
not knowing their object names to request anything in that state.
But I want the database, which I support, to be gentle and helpful to
the user under these circumstances.

Regarding complexity: I think the permutation matrix is the thing to
easily get wrong. I had a one off bug writing it down initially.
I tried to explain the conceptual approach better with a longer comment
than before.

                /*
                 * Only consider mirroring permutations, since the
three simple rotations are already
                 * (or will be for a later underscore_current) covered
above.
                 *
                 * The entries of the permutation matrix tell us, where
we should copy the tree segments to.
                 * The zeroth dimension iterates over the permutations,
while the first dimension iterates
                 * over the three segments are permuted to.
                 * Considering the string A_B_C the three segments are:
                 * - before the initial underscore sections (A)
                 * - between the underscore sections (B)
                 * - after the later underscore sections (C)
                 */

If anything is still unclear, I'd appreciate feedback about what might
be still unclear/confusing about this.
I can't promise to be helpful, if something breaks. But I have
practically forgotten how I did it, and I found it easy to extend it
like described below. It would have been embarrassing otherwise. Yet
this gives me hope, it should be possible to enable others the same way.
I certainly want the code simple without need to reverse-engineer
anything. Please let me know, if there are difficult to understand bits
left around.

One thing that struck me is that no care is being taken for adjacent
underscores (that is, "foo__bar" and similar cases).  It seems
unlikely that treating the zero-length substring between the
underscores as a word to permute is helpful; moreover, it adds
an edge case that the string-moving logic could easily get wrong.
I wonder if the code should treat any number of consecutive
underscores as a single separator.  (Somewhat related: I think it
will behave oddly when the first or last character is '_', since the
outer loop ignores those positions.)

I wasn't sure how there could be any potential future bug with copying
zero-length strings, i.e. doing nothing. And I still don't see that.

There is one point I agree with: Doing this seems rarely helpful. I
changed the code, so it treats sections delimited by an arbitrary amount
of underscores.
So it never permutes with zero length strings within. I also added
functionality to skip the zero length cases if we should encounter them
at the end of the string.
So afaict there should be no zero length swaps left. Please let me know
whether this is more to your liking.

I also replaced the hard limit of underscores with more nuanced limits
of permutations to try before giving up.

And it would be much more convenient to work with your patch if

every next

version file will have a unique name (maybe something like "_v2", "_v3"
etc. suffixes)

Please.  It's very confusing when there are multiple identically-named
patches in a thread.

Sorry, I started with this, because I confused cf bot in the past about
whether the patches should be applied on top of each other or not.

For me the cf-bot logic is a bit opaque there. But you are right,
confusing patch readers is definitely worse. I'll try to do that. I hope
the attached format is better.

One question about pgindent: I struggled a bit with getting the right
version of bsd_indent. I found versions labeled 2.2.1 and 2.1.1, but
apparently we work with 2.1.2. Where can I get that?

Regards
Arne

Attachments:

0001-fuzzy_underscore_permutation_v3.patchtext/x-patch; charset=UTF-8; name=0001-fuzzy_underscore_permutation_v3.patchDownload+150-23
#2Peter Smith
smithpb2250@gmail.com
In reply to: Arne Roland (#1)

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4282/, but it seems
like there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4282. Please
have a look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4282/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4282

Kind Regards,
Peter Smith.

#3Arne Roland
arne.roland@malkut.net
In reply to: Peter Smith (#2)

Thank you for bringing that to my attention. Is there a way to subscribe
to cf-bot failures?

Apparently I confused myself with my naming. I attached a patch that
fixes the bug (at least at my cassert test-world run).

Regards
Arne

Show quoted text

On 2024-01-22 06:38, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there were CFbot test failures last time it was run [2]. Please
have a look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4282/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4282

Kind Regards,
Peter Smith.

Attachments:

0001-fuzzy_underscore_permutation_v4.patchtext/x-patch; charset=UTF-8; name=0001-fuzzy_underscore_permutation_v4.patchDownload+150-23
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Arne Roland (#3)

Arne Roland <arne.roland@malkut.net> writes:

Thank you for bringing that to my attention. Is there a way to subscribe
to cf-bot failures?

I don't know of any push notification support in cfbot, but you
can bookmark the page with your own active patches, and check it
periodically:

http://commitfest.cputube.org/arne-roland.html

(For others, click on your own name in the main cfbot page's entry for
one of your patches to find out how it spelled your name for this
purpose.)

regards, tom lane

#5Arne Roland
arne.roland@malkut.net
In reply to: Tom Lane (#4)

Thank you! I wasn't aware of the filter per person. It was quite simple
integrate a web scraper into my custom push system.

Regarding the patch: I ran the 2.1.1 version of pg_bsd_indent now. I
hope that suffices. I removed the matrix declaration to make it C90
complaint. I attached the result.

Regards
Arne

Show quoted text

On 2024-01-22 19:22, Tom Lane wrote:

Arne Roland <arne.roland@malkut.net> writes:

Thank you for bringing that to my attention. Is there a way to subscribe
to cf-bot failures?

I don't know of any push notification support in cfbot, but you
can bookmark the page with your own active patches, and check it
periodically:

http://commitfest.cputube.org/arne-roland.html

(For others, click on your own name in the main cfbot page's entry for
one of your patches to find out how it spelled your name for this
purpose.)

regards, tom lane

Attachments:

0001-fuzzy_underscore_permutation_v5.patchtext/x-patch; charset=UTF-8; name=0001-fuzzy_underscore_permutation_v5.patchDownload+220-28
#6Andrey Borodin
amborodin@acm.org
In reply to: Arne Roland (#5)

On 23 Jan 2024, at 09:42, Arne Roland <arne.roland@malkut.net> wrote:

<0001-fuzzy_underscore_permutation_v5.patch>

Mikhail, there’s a new patch version. May I ask you to review it?

Best regards, Andrey Borodin.