Copypasta in the PostgreSQL source

Started by David Fetterover 7 years ago7 messageshackers

david@fetter.org

over 7 years ago

Folks,

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

- Is it worth trying to track these down and factor them out?
- Would it make sense to make some kind of git commit trigger that at
least warns when a new one has been introduced?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: David Fetter (#1)

Re: Copypasta in the PostgreSQL source

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining
about is any duplicated code segments anywhere. I do not find
this helpful. Refactoring to the point that dozen-line code
stanzas never appear more than once would be incredibly invasive,
likely very bad for performance, and I don't think it'd improve
readability either.

- Would it make sense to make some kind of git commit trigger that at
least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Tom Lane (#2)

Re: Copypasta in the PostgreSQL source

On 2018-Dec-17, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining
about is any duplicated code segments anywhere. I do not find
this helpful. Refactoring to the point that dozen-line code
stanzas never appear more than once would be incredibly invasive,
likely very bad for performance, and I don't think it'd improve
readability either.

Agreed. Skimming the report, we get some very silly duplications, such
as a function definition duplicating its prototype.

I agree that this is mostly unhelpful noise and we shouldn't spend too
much time on it.

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Alvaro Herrera (#3)

Re: Copypasta in the PostgreSQL source

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

Yeah, it might be time to move something like that into a common
location. Not sure where the threshold of pain is, though.

regards, tom lane

David Fetter

david@fetter.org

over 7 years ago

In reply to: Tom Lane (#2)

Re: Copypasta in the PostgreSQL source

On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated
tokens. I've removed some things that seem like false positives,
basically all from the stemmer part of the source, but there's
still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining about
is any duplicated code segments anywhere. I do not find this
helpful. Refactoring to the point that dozen-line code stanzas
never appear more than once would be incredibly invasive, likely
very bad for performance, and I don't think it'd improve readability
either.

Is there a threshold, possibly much larger than a dozen lines, where
such refactoring would actually make sense?

- Would it make sense to make some kind of git commit trigger that
at least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

Where's a better place for such heuristics? I'd like to think that
there are opportunities for more automation than we currently have, on
that score. Maybe a `make` target?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: David Fetter (#5)

Re: Copypasta in the PostgreSQL source

Hi,

On 2018-12-18 00:36:55 +0100, David Fetter wrote:

On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated
tokens. I've removed some things that seem like false positives,
basically all from the stemmer part of the source, but there's
still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining about
is any duplicated code segments anywhere. I do not find this
helpful. Refactoring to the point that dozen-line code stanzas
never appear more than once would be incredibly invasive, likely
very bad for performance, and I don't think it'd improve readability
either.

Is there a threshold, possibly much larger than a dozen lines, where
such refactoring would actually make sense?

Sure, sometimes. But it seems unlikely that a report based on the
technology at hand here would be useful. If none of the code has changed
in recent times, it's not that usually worthy of a lot of attention to
adapt, unless we're talking much larger pieces of code. And a fair bit
of what's pointed out in that report is well known (e.g. that ecpg
duplicates a lot of code in a horrible manner). I think to be useful
it'd need to actually point out very similar code, that's starting to
diverge further - but that's obviously a much harder thing to achieve.

- Would it make sense to make some kind of git commit trigger that
at least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

Where's a better place for such heuristics? I'd like to think that
there are opportunities for more automation than we currently have, on
that score. Maybe a `make` target?

If we find useful targets, we can integrate them that way. I don't
consider this to be a worthy case though.

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

over 7 years ago

In reply to: Tom Lane (#4)

Re: Copypasta in the PostgreSQL source

On Tue, Dec 18, 2018 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

Yeah, it might be time to move something like that into a common
location. Not sure where the threshold of pain is, though.

Yeah. I think we need bitutils.c as discussed here:

/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com

I sort of foundered on the the difficulty of using popcnt and
bitscan/log2 instructions/builtins without putting a runtime test and
a honking great function pointer in front of them. +1 for at least
deduplicating all these implementations as discussed in that thread.

--
Thomas Munro
http://www.enterprisedb.com

Copypasta in the PostgreSQL source

Attachments: