Greatest Common Divisor
I recently came across the need for a gcd function (actually I needed
lcm) and was surprised that we didn't have one.
So here one is, using the basic Euclidean algorithm. I made one for
smallint, integer, and bigint.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Attachments:
gcd.0001.patchtext/x-patch; charset=UTF-8; name=gcd.0001.patchDownload+123-0
Bonsoir Vik,
I recently came across the need for a gcd function (actually I needed
lcm) and was surprised that we didn't have one.
Why not.
So here one is, using the basic Euclidean algorithm. I made one for
smallint, integer, and bigint.
Should pg provide the LCM as well? Hmmm, probably not, too likely to
overflow.
Should there be a NUMERIC version as well? I'd say maybe yes.
I'm wondering what it should do on N, 0 and 0, 0. Raise an error? Return
0? Return 1? return N? There should be some logic and comments explaining
it.
I'd test with INT_MIN and INT_MAX.
Given that there are no overflows risk with the Euclidian descent, would
it make sense that the int2 version call the int4 implementation?
C modulo operator (%) is a pain because it is not positive remainder (2 %
-3 == -1 vs 2 % 3 == 2, AFAICR). It does not seem that fixing the sign
afterwards is the right thing to do. I'd rather turn both arguments
positive before doing the descent.
Which raises an issue with INT_MIN by the way, which has no positive:-(
Also, the usual approach is to exchange args so that the largest is first,
because there may be a software emulation if the hardware does not
implement modulo. At least it was the case with some sparc processors 25
years ago:-)
See for instance (the int min value is probably not well handled):
https://svn.cri.ensmp.fr/svn/linear/trunk/src/arithmetique/pgcd.c
Basically, welcome to arithmetic:-)
--
Fabien.
On 28/12/2019 19:15, Fabien COELHO wrote:
So here one is, using the basic Euclidean algorithm. I made one for
smallint, integer, and bigint.Should pg provide the LCM as well? Hmmm, probably not, too likely to
overflow.
I decided against it for that reason.
Should there be a NUMERIC version as well? I'd say maybe yes.
I thought about that, too, but also decided against it for this patch.
I'm wondering what it should do on N, 0 and 0, 0. Raise an error?
Return 0? Return 1? return N? There should be some logic and comments
explaining it.
Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here?
I'd test with INT_MIN and INT_MAX.
Okay, I'll add tests for those, instead of the pretty much random values
I have now.
Given that there are no overflows risk with the Euclidian descent, would
it make sense that the int2 version call the int4 implementation?
Meh.
C modulo operator (%) is a pain because it is not positive remainder
(2 % -3 == -1 vs 2 % 3 == 2, AFAICR).
This does not seem to be the case...
It does not seem that fixing the sign afterwards is the right thing to
do. I'd rather turn both arguments positive before doing the descent.
Why isn't it the right thing to do?
Which raises an issue with INT_MIN by the way, which has no positive:-(
That's an argument against abs-ing the input values. It's only an issue
with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN. Any other
value with INT_MIN will be 1 or something lower than INT_MAX.
Also, the usual approach is to exchange args so that the largest is
first, because there may be a software emulation if the hardware does
not implement modulo. At least it was the case with some sparc
processors 25 years ago:-)
The args will exchange themselves.
Thanks for the review! Attached is a new patch that changes the
regression tests based on your comments (and another comment that I got
on irc to test gcd(b, a)).
Attachments:
gcd.0002.patchtext/x-patch; charset=UTF-8; name=gcd.0002.patchDownload+168-0
Bonjour Vik,
Should there be a NUMERIC version as well? I'd say maybe yes.
I thought about that, too, but also decided against it for this patch.
Hmmm. ISTM that int functions are available for numeric?
I'm wondering what it should do on N, 0 and 0, 0. Raise an error?
Return 0? Return 1? return N? There should be some logic and comments
explaining it.Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here?
I think that there should be a comment.
I'd test with INT_MIN and INT_MAX.
Okay, I'll add tests for those, instead of the pretty much random values
I have now.C modulo operator (%) is a pain because it is not positive remainder
(2 % -3 == -1 vs 2 % 3 == 2, AFAICR).This does not seem to be the case...
Indeed, I tested quickly with python, but it has yet another behavior as
shown above, what a laugh!
So with C: 2 % -3 == 2, -2 % 3 == -2
Note that AFAICS there is no integer i so that 3 * i - (-2) == -2.
It does not seem that fixing the sign afterwards is the right thing to
do. I'd rather turn both arguments positive before doing the descent.Why isn't it the right thing to do?
Because I do not trust C modulo as I had a lot of problems with it? :-)
If it works, but it should deserve a clear comment explaining why.
Which raises an issue with INT_MIN by the way, which has no positive:-(
That's an argument against abs-ing the input values. It's only an issue
with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN.
That should be an error instead, because -INT_MIN cannot be represented?
Any other value with INT_MIN will be 1 or something lower than INT_MAX.
Looks ok.
Also, the usual approach is to exchange args so that the largest is
first, because there may be a software emulation if the hardware does
not implement modulo. At least it was the case with some sparc
processors 25 years ago:-)The args will exchange themselves.
Yep, but after a possibly expensive software-emulated modulo operation?
--
Fabien.
On 29/12/2019 08:30, Fabien COELHO wrote:
I'm wondering what it should do on N, 0 and 0, 0. Raise an error?
Return 0? Return 1? return N? There should be some logic and comments
explaining it.Well, gcd(N, 0) is N, and gcd(0, 0) is 0, so I don't see an issue here?
I think that there should be a comment.
Done.
It does not seem that fixing the sign afterwards is the right thing to
do. I'd rather turn both arguments positive before doing the descent.Why isn't it the right thing to do?
Because I do not trust C modulo as I had a lot of problems with it? :-)
If it works, but it should deserve a clear comment explaining why.
Surely such a comment should be on the mod functions and not in this patch.
Which raises an issue with INT_MIN by the way, which has no positive:-(
That's an argument against abs-ing the input values. It's only an issue
with gcd(INT_MIN, INT_MIN) which currently returns INT_MIN.That should be an error instead, because -INT_MIN cannot be represented?
Why should it error? Is INT_MIN not a valid divisor of INT_MIN? I
added a comment instead.
Also, the usual approach is to exchange args so that the largest is
first, because there may be a software emulation if the hardware does
not implement modulo. At least it was the case with some sparc
processors 25 years ago:-)The args will exchange themselves.
Yep, but after a possibly expensive software-emulated modulo operation?
I'll just trust you on this. Swap added.
Thanks!
--
Vik
Attachments:
gcd.0003.patchtext/x-patch; charset=UTF-8; name=gcd.0003.patchDownload+207-0
On 12/29/19 02:30, Fabien COELHO wrote:
C modulo operator (%) is a pain because it is not positive remainder
(2 % -3 == -1 vs 2 % 3 == 2, AFAICR).This does not seem to be the case...
...
Because I do not trust C modulo as I had a lot of problems with it? :-)
If I recall correctly (and I'm traveling and away from those notes),
the exact semantics of C's % with negative operands was left
implementation-defined until, was it, C99 ?
So it might be ok to rely on the specified C99 behavior (whichever
behavior that is, he wrote, notelessly) for PG 12 and later, where
C99 is expected.
Regards,
-Chap
Hello,
Because I do not trust C modulo as I had a lot of problems with it?:-)
If I recall correctly (and I'm traveling and away from those notes),
the exact semantics of C's % with negative operands was left
implementation-defined until, was it, C99 ?
Indeed, my woes with C % started before that date:-)
By Googling the C99 spec, I found: "When integers are divided, the result
of the / operator is the algebraic quotient with any fractional part
discarded (aka truncation toward zero). If the quotient a/b is
representable, the expression (a/b)*b + a%b shall equal a."
Let a = 2 and b = -3, then a/b == 0 (-0.666 truncated toward zero), then
(a/b)*b + a%b == a
=> 0 * -3 + (2 % -3) == 2
=> 2 % -3 == 2
Then with a = -2, b = 3, then a/b == 0 (same as above), and the same
reasoning leads to
-2 % 3 == -2
Which is indeed what was produced with C, but not with Python.
The good news is that the absolute value of the modulo is the module in
the usual sense, which is what is needed for the Euclidian descent and
allows fixing the sign afterwards, as Vik was doing.
So it might be ok to rely on the specified C99 behavior (whichever
behavior that is, he wrote, notelessly) for PG 12 and later, where
C99 is expected.
Yep, probably with a comment.
--
Fabien.
Out of curiosity, what was the original use-case for this?
I'm not objecting to adding it, I'm just curious. In fact, I think
that if we do add this, then we should probably add lcm() at the same
time, since handling its overflow cases is sufficiently non-trivial to
justify not requiring users to have to implement it themselves.
I don't like the INT_MIN handling though:
select gcd(-2147483648,0);
gcd
-------------
-2147483648
(1 row)
select gcd(-2147483648,-2147483648);
gcd
-------------
-2147483648
(1 row)
Normally gcd() returns a positive integer, and gcd(a,0) = gcd(a,a) =
abs(a). But since abs(INT_MIN) cannot be represented as a 32-bit
integer, both those cases should throw an integer-out-of-range error.
In addition, the following case should produce 1, but for me it
produces an error. This is actually going to be platform-dependent as
it is currently implemented (see the comments in int4div and int4mod):
select gcd(-2147483648,-1);
ERROR: floating-point exception
DETAIL: An invalid floating-point operation was signaled. This
probably means an out-of-range result or an invalid operation, such as
division by zero.
so there needs to be some special-case INT_MIN handling at the start
to deal with these cases.
AFAIK, what gcd(0,0) should be is not well defined, but most common
implementations seem to return 0 (I checked Matlab, python and Java's
standard libraries). This seems like a reasonable extrapolation of the
rule gcd(a,0) = gcd(0,a) = gcd(a,a) = abs(a), so I don't have a
problem with doing the same here, but I think that it should be
documented (e.g., see [1]https://www.mathworks.com/help/matlab/ref/gcd.html), if for no other reason than users might
expect it to be safe to divide by the result.
Regards,
Dean
Greetings,
* Dean Rasheed (dean.a.rasheed@gmail.com) wrote:
I'm not objecting to adding it, I'm just curious. In fact, I think
that if we do add this, then we should probably add lcm() at the same
time, since handling its overflow cases is sufficiently non-trivial to
justify not requiring users to have to implement it themselves.
I tend to agree with this.
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Dean Rasheed (dean.a.rasheed@gmail.com) wrote:
I'm not objecting to adding it, I'm just curious. In fact, I think
that if we do add this, then we should probably add lcm() at the same
time, since handling its overflow cases is sufficiently non-trivial to
justify not requiring users to have to implement it themselves.
I tend to agree with this.
Does this impact the decision about whether we need a variant for
numeric? I was leaning against that, primarily because (a)
it'd introduce a set of questions about what to do with non-integral
inputs, and (b) it'd make the patch quite a lot larger, I imagine.
But a variant of lcm() that returns numeric would have much more
resistance to overflow.
Maybe we could just define "lcm(bigint, bigint) returns numeric"
and figure that that covers all cases, but it feels slightly
weird. You couldn't do lcm(lcm(a,b),c) without casting.
I guess that particular use-case could be addressed with
"lcm(variadic bigint[]) returns numeric", but that's getting
really odd.
regards, tom lane
On Sat, Dec 28, 2019 at 12:15 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Bonsoir Vik,
I recently came across the need for a gcd function (actually I needed
lcm) and was surprised that we didn't have one.Why not.
Proliferation of code in the public namespace; it can displace code
that is written by others during the upgrade.
merlin
Normally gcd() returns a positive integer, and gcd(a,0) = gcd(a,a) =
abs(a). But since abs(INT_MIN) cannot be represented as a 32-bit
integer, both those cases should throw an integer-out-of-range error.
I'm also in favor of that option, rather than sending a negative result as
a result.
About lcm(a, b): a / gcd(a, b) * b, at least if a & b are positive. If
not, some thoughts are needed:-)
Returning a NUMERIC as suggested by Tom would solve the overflow problem
by sending it back to the user who has to cast. This looks ok to me.
Maybe we could provide "int4 lcm(int2, int2)", "int8 lcm(int4, int4)", as
ISTM that there cannot be overflows on those (eg for the later: lcm <=
a*b, a & b are 31 non-signed bits, 62 bits are needed, 63 are available).
--
Fabien.
On 2020-01-02 15:50, Dean Rasheed wrote:
Out of curiosity, what was the original use-case for this?
Yeah, I'm wondering, is this useful for any typical analytics or
business application? Otherwise, abstract algebra functionality seems a
bit out of scope.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 2020-01-02 15:50, Dean Rasheed wrote:
Out of curiosity, what was the original use-case for this?
Yeah, I'm wondering, is this useful for any typical analytics or
business application? Otherwise, abstract algebra functionality seems a
bit out of scope.
Nobody complained when we added sinh, cosh, tanh, asinh, acosh, atanh
last year, so I'm feeling skeptical of claims that gcd should be out
of scope.
Now, those functions were just exposing libc functionality, so there
wasn't a lot of code to write. There might be a good argument that
gcd isn't useful enough to justify the amount of code we'd have to
add (especially if we allow it to scope-creep into needing to deal
with "numeric" calculations). But I'm not on board with just
dismissing it as uninteresting.
regards, tom lane
On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now, those functions were just exposing libc functionality, so there
wasn't a lot of code to write. There might be a good argument that
gcd isn't useful enough to justify the amount of code we'd have to
add (especially if we allow it to scope-creep into needing to deal
with "numeric" calculations). But I'm not on board with just
dismissing it as uninteresting.
Yeah. There's always the question with things like this as to whether
we ought to push certain things into contrib modules that are not
installed by default to avoid bloating the set of things built into
the core server. But it's hard to know where to draw the line. There's
no objective answer to the question of whether gcd() or sinh() is more
useful to have in core; each is more useful to people who need that
one but not the other, and trying to guess whether more or fewer
people need gcd() than sinh() seems like a fool's errand. Perhaps in
retrospect we would be better off having a 'math' extension where a
lot of this stuff lives, and people who want that extension can
install it and others need not bother. But, to try to create that now
and move things there would break upgrades for an exceedingly marginal
benefit. I don't really like the namespace pollution that comes with
accepting feature requests like this, but it's hard to argue that it's
a serious show-stopper or that the cure is any less bad than the
disease. And I'm sure that I'd be much more likely to use gcd() or
lcm() in a query than tanh()...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2020-Jan-03, Robert Haas wrote:
On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now, those functions were just exposing libc functionality, so there
wasn't a lot of code to write. There might be a good argument that
gcd isn't useful enough to justify the amount of code we'd have to
add (especially if we allow it to scope-creep into needing to deal
with "numeric" calculations). But I'm not on board with just
dismissing it as uninteresting.Yeah. There's always the question with things like this as to whether
we ought to push certain things into contrib modules that are not
installed by default to avoid bloating the set of things built into
the core server. But it's hard to know where to draw the line. There's
no objective answer to the question of whether gcd() or sinh() is more
useful to have in core;
The SQL standard's feature T622 requires trigonometric functions, while
it doesn't list gcd() or anything of the sort, so there's that.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 3, 2020 at 10:24 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 3, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now, those functions were just exposing libc functionality, so there
wasn't a lot of code to write. There might be a good argument that
gcd isn't useful enough to justify the amount of code we'd have to
add (especially if we allow it to scope-creep into needing to deal
with "numeric" calculations). But I'm not on board with just
dismissing it as uninteresting.Yeah. There's always the question with things like this as to whether
we ought to push certain things into contrib modules that are not
installed by default to avoid bloating the set of things built into
the core server. But it's hard to know where to draw the line.
Just stop doing it. It's very little extra work to package an item
into an extension and this protects your hapless users who might have
implemented a function called gcd() that does something different.
Ideally, the public namespace should contain (by default) only sql
standard functions with all non-standard material in an appropriate
extension. Already released material is obviously problematic and
needs more thought but we ought to at least stop making the problem
worse if possible.
merlin
On 02/01/2020 16:12, Tom Lane wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Dean Rasheed (dean.a.rasheed@gmail.com) wrote:
I'm not objecting to adding it, I'm just curious. In fact, I think
that if we do add this, then we should probably add lcm() at the same
time, since handling its overflow cases is sufficiently non-trivial to
justify not requiring users to have to implement it themselves.I tend to agree with this.
Does this impact the decision about whether we need a variant for
numeric? I was leaning against that, primarily because (a)
it'd introduce a set of questions about what to do with non-integral
inputs, and (b) it'd make the patch quite a lot larger, I imagine.
But a variant of lcm() that returns numeric would have much more
resistance to overflow.Maybe we could just define "lcm(bigint, bigint) returns numeric"
and figure that that covers all cases, but it feels slightly
weird. You couldn't do lcm(lcm(a,b),c) without casting.
I guess that particular use-case could be addressed with
"lcm(variadic bigint[]) returns numeric", but that's getting
really odd.
Okay. Here is a version that should handle everyone's comments.
gcd() is now strictly positive, so INT_MIN is no longer a valid result.
I added an lcm() function. It returns the same type as its arguments so
overflow is possible. I made this choice because int2mul returns int2
(and same for its friends). One can just cast to a bigger integer if
needed.
Because of that, I added a version of gcd() and lcm() for numeric. This
was my first time working with numeric so reviewers should pay extra
attention there, please.
Patch attached.
--
Vik Fearing
Attachments:
gcd.0004.patchtext/x-patch; charset=UTF-8; name=gcd.0004.patchDownload+800-0
On Fri, Jan 3, 2020 at 1:10 PM Merlin Moncure <mmoncure@gmail.com> wrote:
Just stop doing it. It's very little extra work to package an item
into an extension and this protects your hapless users who might have
implemented a function called gcd() that does something different.
Ideally, the public namespace should contain (by default) only sql
standard functions with all non-standard material in an appropriate
extension. Already released material is obviously problematic and
needs more thought but we ought to at least stop making the problem
worse if possible.
There are counter-arguments to that, though. Maintaining a lot of
extensions with only one or two functions in them is a nuisance.
Having things installed by default is convenient for wanting to use
them. Maintaining contrib code so that it works whether or not the SQL
definitions have been updated via ALTER EXTENSION .. UPDATE takes some
work and thought, and sometimes we screw it up.
I don't find any position on this topic to be without merit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 1/3/20 1:46 PM, Robert Haas wrote:
On Fri, Jan 3, 2020 at 1:10 PM Merlin Moncure <mmoncure@gmail.com> wrote:
Just stop doing it. It's very little extra work to package an item
into an extension and this protects your hapless users who might have
implemented a function called gcd() that does something different.
...There are counter-arguments to that, though. Maintaining a lot of
extensions with only one or two functions in them is a nuisance.
Having things installed by default is convenient for wanting to use
them. Maintaining contrib code so that it works whether or not the SQL
definitions have been updated via ALTER EXTENSION .. UPDATE takes some
work and thought, and sometimes we screw it up.
Is there a middle ground staring us in the face, where certain things
could be added in core, but in a new schema like pg_math (pg_ !), so
if you want them you put them on your search path or qualify them
explicitly, and if you don't, you don't?
Regards,
-Chap