Copypasta in the PostgreSQL source

Started by David Fetterabout 7 years ago7 messages
#1David Fetter
david@fetter.org
1 attachment(s)

Folks,

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

- Is it worth trying to track these down and factor them out?
- Would it make sense to make some kind of git commit trigger that at
least warns when a new one has been introduced?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachments:

copypasta.txt.gzapplication/gzipDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#1)
Re: Copypasta in the PostgreSQL source

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining
about is any duplicated code segments anywhere. I do not find
this helpful. Refactoring to the point that dozen-line code
stanzas never appear more than once would be incredibly invasive,
likely very bad for performance, and I don't think it'd improve
readability either.

- Would it make sense to make some kind of git commit trigger that at
least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

regards, tom lane

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Copypasta in the PostgreSQL source

On 2018-Dec-17, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining
about is any duplicated code segments anywhere. I do not find
this helpful. Refactoring to the point that dozen-line code
stanzas never appear more than once would be incredibly invasive,
likely very bad for performance, and I don't think it'd improve
readability either.

Agreed. Skimming the report, we get some very silly duplications, such
as a function definition duplicating its prototype.

I agree that this is mostly unhelpful noise and we shouldn't spend too
much time on it.

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: Copypasta in the PostgreSQL source

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

Yeah, it might be time to move something like that into a common
location. Not sure where the threshold of pain is, though.

regards, tom lane

#5David Fetter
david@fetter.org
In reply to: Tom Lane (#2)
Re: Copypasta in the PostgreSQL source

On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated
tokens. I've removed some things that seem like false positives,
basically all from the stemmer part of the source, but there's
still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining about
is any duplicated code segments anywhere. I do not find this
helpful. Refactoring to the point that dozen-line code stanzas
never appear more than once would be incredibly invasive, likely
very bad for performance, and I don't think it'd improve readability
either.

Is there a threshold, possibly much larger than a dozen lines, where
such refactoring would actually make sense?

- Would it make sense to make some kind of git commit trigger that
at least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

Where's a better place for such heuristics? I'd like to think that
there are opportunities for more automation than we currently have, on
that score. Maybe a `make` target?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#6Andres Freund
andres@anarazel.de
In reply to: David Fetter (#5)
Re: Copypasta in the PostgreSQL source

Hi,

On 2018-12-18 00:36:55 +0100, David Fetter wrote:

On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Please find attached a run of a tool that looks for duplicated
tokens. I've removed some things that seem like false positives,
basically all from the stemmer part of the source, but there's
still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining about
is any duplicated code segments anywhere. I do not find this
helpful. Refactoring to the point that dozen-line code stanzas
never appear more than once would be incredibly invasive, likely
very bad for performance, and I don't think it'd improve readability
either.

Is there a threshold, possibly much larger than a dozen lines, where
such refactoring would actually make sense?

Sure, sometimes. But it seems unlikely that a report based on the
technology at hand here would be useful. If none of the code has changed
in recent times, it's not that usually worthy of a lot of attention to
adapt, unless we're talking much larger pieces of code. And a fair bit
of what's pointed out in that report is well known (e.g. that ecpg
duplicates a lot of code in a horrible manner). I think to be useful
it'd need to actually point out very similar code, that's starting to
diverge further - but that's obviously a much harder thing to achieve.

- Would it make sense to make some kind of git commit trigger that
at least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

Where's a better place for such heuristics? I'd like to think that
there are opportunities for more automation than we currently have, on
that score. Maybe a `make` target?

If we find useful targets, we can integrate them that way. I don't
consider this to be a worthy case though.

Greetings,

Andres Freund

#7Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#4)
Re: Copypasta in the PostgreSQL source

On Tue, Dec 18, 2018 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

Yeah, it might be time to move something like that into a common
location. Not sure where the threshold of pain is, though.

Yeah. I think we need bitutils.c as discussed here:

/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com

I sort of foundered on the the difficulty of using popcnt and
bitscan/log2 instructions/builtins without putting a runtime test and
a honking great function pointer in front of them. +1 for at least
deduplicating all these implementations as discussed in that thread.

--
Thomas Munro
http://www.enterprisedb.com