Re: AW: type conversion discussion
Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:
I think the topmost numeric-type needs to be numeric, since it is the
only type with arbitrary scale and precision.
Thus I think we would need:
int2,int4,int8,float4,float8,numeric
No, this is wrong because it contradicts SQL92: float + numeric must
yield float, not numeric.
But the above is still not correct, in the sence that e.g. int8 cannot be
converted to float4
without loss. In that sense I don't think one upward promotion info is
sufficient.
An important component of the second proposal is that the actual data
conversion is done in one step if possible. We will *consider* using
float4 before we consider float8, but if we end up using float8 then
we try to do a direct whatever-to-float8 conversion. So as long as the
right set of conversion operators are available, there's no unnecessary
precision loss.
regards, tom lane
Import Notes
Reply to msg id not found: 219F68D65015D011A8E000006F8590C604AF7D7D@sdexcsrv1.f000.d0188.sd.spardat.atReference msg id not found: 219F68D65015D011A8E000006F8590C604AF7D7D@sdexcsrv1.f000.d0188.sd.spardat.at
Peter Eisentraut <e99re41@DoCS.UU.SE> writes:
I think your plan looks good for the numerical land. (I'll ponder the oid
issues in a second.) For other type categories, perhaps not. Should a line
be promoted to a polygon so you can check if it contains a point? Or a
polygon to a box? Higher dimensions? :-)
Line->polygon, probably not. On the other hand I can certainly imagine
that box->polygon would be a reasonable promotion. The point of the
proposal is to make these choices table-driven, in any event; so they'd
be fairly easy to change if someone didn't like them.
[ enumerates cases where casting is needed ]
e) The function is overloaded for many types, amongst which is text. Then
call the text version. I believe this would currently fail, which I'd
consider a deficiency.
This seems to be the only case that's really worth debating. Is it
better to fail (drawing attention to the ambiguity) than to make a
default assumption? I tend to agree that we want a default, but
reasonable people might disagree.
The fact that an oid is also a number should be an implementation detail.
Could be. A version or three ago you actually did have to write
... where oid = 1234::oid
if you wanted to refer to a specific row by OID. However, while it
might be logically purer to insist that OIDs are not numbers, it's just
too damn handy to be laxer about the distinction. I've used both the
old and new behavior and I much prefer the new. If you want an actual
argument for it: I doubt that ordinary users touch OIDs at all, and
the ones who do probably know what they're doing. You might see some
inconsistency between my position on OIDs and my position on booleans
(where I *don't* want cast-free conversions), but I draw the distinction
from experience about what sorts of operations are useful and how many
real-life errors can be caught by hard-line error checks.
regards, tom lane
Import Notes
Reply to msg id not found: Pine.GSO.4.02A.10005151309020.26399-100000@Zebra.DoCS.UU.SEReference msg id not found: Pine.GSO.4.02A.10005151309020.26399-100000@Zebra.DoCS.UU.SE | Resolved by subject fallback
I wrote:
But the above is still not correct, in the sence that e.g. int8
cannot be converted to float4 without loss. In that sense I don't
think one upward promotion info is sufficient.
An important component of the second proposal is that the actual data
conversion is done in one step if possible. We will *consider* using
float4 before we consider float8, but if we end up using float8 then
we try to do a direct whatever-to-float8 conversion. So as long as the
right set of conversion operators are available, there's no unnecessary
precision loss.
After further thought I see that there is still a risk here, which
depends on the presence or absence of specific functions. Suppose that
we offer cos(float4) and cos(float8), but not cos(numeric). With the
proposal as given, the system would execute cos(numericVar) as
cos(float4(numericVar)) which is probably not the most desirable
choice --- but that would be the "least promoted" alternative.
Considering this example, I think that the proposed numeric hierarchy
needs to be altered. Instead of
int2 -> int4 -> int8 -> numeric -> float4 -> float8
perhaps we want
int2 -> int4 -> int8 -> numeric -> float8
float4 -> float8
That is, float4 promotes to float8 but nothing else promotes to float4.
This still satisfies the SQL92 rule that mixed exact/inexact
computations yield inexact results --- but those results will always be
done in float8 now, never in float4. The only way to get a float4
computation is to start from float4 variables or use explicit casts.
That's still not entirely satisfactory because simple examples like
WHERE float4var < 4.4;
won't be done the way we want: the constant will promote to float8
and then you'll get float4var::float8 < 4.4::float8 which is not
able to use a float4 index.
A sneaky way around that is to make the hierarchy
int2 -> int4 -> int8 -> numeric -> float8 -> float4
which is nonintuitive as hell, but would make mixed exact/float8
calculations do the right thing. But a mixed float8/float4
computation would be done in float4 which is not so desirable.
My inclination at this point is that we want the auto promotion
hierarchy to look like
int2 -> int4 -> int8 -> numeric -> float8
float4 -> float8
but perhaps to use a different method for assigning types to numeric
literals, such that a literal can be coerced to float4 if there are
other float4s present, even though we wouldn't do that for nonliterals.
(This could maybe be done by initially assigning literals an
UNKNOWNNUMERIC data type, which then gets resolved to a specific type,
much like we do for string literals.) A tad ugly, but I'm beginning to
doubt we can get *all* the behaviors we want without any special cases.
regards, tom lane
Tom Lane writes:
The fact that an oid is also a number should be an implementation detail.
Could be. A version or three ago you actually did have to write
... where oid = 1234::oid
if you wanted to refer to a specific row by OID. However, while it
might be logically purer to insist that OIDs are not numbers, it's just
too damn handy to be laxer about the distinction.
Definitely. But wouldn't three (or six) extra `=' operators be the road of
least resistance or clearest separation? Not sure.
I doubt that ordinary users touch OIDs at all, and the ones who do
probably know what they're doing.
Certain elements around these parts actively advocate using oids for keys
or even unsigned numbers (*shudder*). I wouldn't be so sure about this
statement at all.
One thing to keep in mind in any case is that oids might not be int4-like
forever, eventually we might want int8, or the unsigned version thereof.
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes:
if you wanted to refer to a specific row by OID. However, while it
might be logically purer to insist that OIDs are not numbers, it's just
too damn handy to be laxer about the distinction.
Definitely. But wouldn't three (or six) extra `=' operators be the road of
least resistance or clearest separation? Not sure.
Actually, that's what we've got now: "oid = 1234" gets parsed into the
oideqint4 operator. What bugs me about that is the shenanigans the
optimizer has to pull to use an index on the oid column. I'm hoping
that we can clean up this mess enough so that the operator delivered by
the parser is the same thing the column's index claims to use in the
first place.
One thing to keep in mind in any case is that oids might not be int4-like
forever, eventually we might want int8, or the unsigned version thereof.
Agreed, but with any luck that case will work transparently too: the
constant will just get promoted up to int8 before we apply the operator.
regards, tom lane
Tom Lane writes:
If there are multiple possibilities, we choose the one which is the
"least promoted" by some yet-to-be-determined metric.
A "metric" is often one step away from "unpredictable behaviour", at least
for users.
But if you look in practice then there are not really any existing
functions that could benefit from this, at least not if the other
proposals go through and/or a few function entries are added or removed.
Let's say you have a function foo(float8, int2) and one foo(float4, int8)
and you call it with (you guessed it) float4 and int2. Which do you take?
If there is a good reason that these two functions exist separate in the
way they are then the decision should probably not be made by some
casting-cost metric but by the user. If there is no good reason that the
functions are like this then perhaps the implementator should be made
aware of the (useless?) ambiguity and maybe provide a grand unified
(float8, int8) version.
If you do want to support this sort of ambiguity resolution then the
metric should IMHO be how many arguments you had to cast away from the
input type at all. That *might* give reasonable behaviour for users,
though I'm not completely sure.
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Import Notes
Reply to msg id not found: 21258.958352597@sss.pgh.pa.us | Resolved by subject fallback
Peter Eisentraut <peter_e@gmx.net> writes:
But if you look in practice then there are not really any existing
functions that could benefit from this,
While looking at existing functions can help detect problems in a
proposal like this, the standard set of functions is *not* the be-all
and end-all; the whole point is that we are trying to build an
extensible system. So I'm pretty suspicious of any argument that
begins in the above fashion ;-)
Let's say you have a function foo(float8, int2) and one foo(float4, int8)
and you call it with (you guessed it) float4 and int2. Which do you take?
A good point; I wouldn't object to returning an error if we determine
that there are multiple equally-good possibilities. But, again, the
sticky question is equally good according to what metric?
If you do want to support this sort of ambiguity resolution then the
metric should IMHO be how many arguments you had to cast away from the
input type at all.
Most of the cases we will be dealing with in practice are operators of
one or two arguments, so a metric that only has two or three possible
values (respectively) is not going to be very useful... especially
since the exact-match case is not interesting. With the above metric
we'd never be able to resolve any ambiguous unary-operator cases at all!
regards, tom lane
Tom Lane writes:
perhaps we want
int2 -> int4 -> int8 -> numeric -> float8
float4 -> float8
In a parallel email you mentioned that your promotion tree idea will give
the system well-understood (single) inheritance semantics, with which I
agree 100%. But that is only true if the upward conversion always works,
which it won't because not every numeric "is-a" float8, and strictly
speaking, neither is int8. This is unclean at best, but might even cause
genuine failures if some promotion metric decided on a float8 function
over a numeric function because it would generate the "least casting" on
the other function attributes.
So it would have to be more like this
int2 -> int4 -> int8 -> numeric
float4 -> float8 -> numeric
This tree is "correct" in the above sense but has a number of obvious
problems.
float[x] + numeric would now yield numeric. The solution is making an
explicit float8+numeric function. Okay, so at the end it's actually more
like 8 functions, but that's a price I'd be willing to pay. (Perhaps the
commutator mechanism could be extended to cover different types as well.)
Incidentally, this would also enable some cases to work that wouldn't now,
e.g. if N is a numeric outside the range of float8 and F is some float8,
then N - F would currently fail, but it need not, depending on how it's
implemented.
The other problem is that integers would never implicitly be promoted to
floats. This is sensible behaviour from a numerical analysis point of view
but probably not acceptable for many. However, given that there is
numeric, any int/float operations would be promoted to numeric/numeric,
which is in any case the right thing to do. The only thing is to provide
numeric functions.
The alternative is to use a non-tree lattice for type promotion
- float4 -- float8 -
/ / \
int2 --- int4 ---- int8 ----- numeric
but that would introduce a world of problems which we probably best avoid
(as long as possible).
That's still not entirely satisfactory because simple examples like
WHERE float4var < 4.4;
won't be done the way we want: the constant will promote to float8
and then you'll get float4var::float8 < 4.4::float8 which is not
able to use a float4 index.
Could this do it?
unknownnumeric -> float4 -> float8 -> numeric
(Assuming `unknownnumeric' only represents literals with decimal points.
Others should probably be the "best fit" integer type.)
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Tom Lane writes:
Let's say you have a function foo(float8, int2) and one foo(float4, int8)
and you call it with (you guessed it) float4 and int2. Which do you take?A good point; I wouldn't object to returning an error if we determine
that there are multiple equally-good possibilities. But, again, the
sticky question is equally good according to what metric?
IMO that metric should be "existance". Anything else is bound to be
non-obvious. Surely some breadth-first or depth-first search through the
imaginary casting tree would yield reasonable results, but it's still
confusing. I don't see any good reasons why one would have such
ambiguously overloaded functions, though I'd be very interested to see
one.
In fact one might consider preventing *creation* of such setups. That's
what programming languages would do. If I'm not mistaken then what I'm
saying is that overloaded functions must form a lattice when ordered
according to the elsewhere proposed promotion hierarchy of their
arguments. That ought to be a doable thing to check for and then we could
also use lattice concepts to find the best fitting function. Gotta work
this out in detail though.
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
I wrote:
If I'm not mistaken then what I'm saying is that overloaded functions
must form a lattice [...]
Indeed, I was. This attacks the problem from a different side, namely not
allowing setups that are ambiguous in the first place.
In detail:
If T1 and T2 are datatypes then T1 < T2 iff T1 is allowed to be implicitly
converted to T2, according to the promotion scheme du jour. If neither T1
promotes to T2 nor vice versa then T1 and T2 are not comparable.
Let A and B be functions with argument types (a1, a2, ... an) and (b1, b2,
... bn) respectively. Then we say that A < B iff ai < bi for all i=1..n.
If neither A < B nor B > A then A and B are not comparable. (That would
include such cases as foo(int2, float8) and foo(int8, float4).)
Define a partition on the set of all functions. Two functions F(f1, f2,
... fm) and G(g1, g2, ... gn) are in the same equivalence class iff "F" =
"G" (same name), m = n, and fi and gi are comparable for all i = 1..m. Now
I propose that you always ensure (preferably at function creation time)
that each equivalence class is a lattice under the partial order described
in the previous paragraph.
This helps because... Given a function call Q that you want to resolve you
first identify its equivalence class. Then there are three possibilities:
1) Q is not comparable to any other function in the equivalence class.
That means that the provided set of functions is incomplete. Proof: Assume
there was a function P which could handle call Q. Then, since you can only
cast "up", the arguments of P must at each position be >= those of Q. But
then P > Q.
2) Q is greater than any element in the equivalence class. This also means
that the set of provided function is unable to handle Q.
3) There exists at least one function P in the equivalence class such that
Q <= P. Consider the set A of all functions in the equivalence class that
are >= Q. Since the class is a lattice, A has a (unique) greatest lower
bound. That's the function you call.
Raise your hand if you are lost...
An example, say you define foo(int2, float8) and now want to add foo(int8,
float4), then the system would reject that because of potential ambiguity
and require adding foo(int8, float8) and foo(int2, float4) before creating
foo(int8, float4); or you would perhaps realize what you are doing and
instead just provide foo(int8, float8), or whatever you really wanted in
the first place.
(Actually, the requirement that you add foo(int8, float8) is useless,
since there is no down-casting and we don't use the least upper bound
property of lattices, so probably in practice one could settle for a
"half-lattice" requirement.)
Now I realize that this a pretty deep proposal, but at least it's a
solidly founded one (I think), it prevents the programmer-user from doing
unwise things, and it avoids any unpredictable "metrics".
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes:
I propose that you always ensure (preferably at function creation time)
that each equivalence class is a lattice under the partial order described
in the previous paragraph.
Er ... weren't you just objecting to metric-based resolution on the
grounds that it'd be unintelligible to users? This has got that
problem ten times worse. In fact, I'm not sure *you* understand it.
3) There exists at least one function P in the equivalence class such that
Q <= P. Consider the set A of all functions in the equivalence class that
are >= Q. Since the class is a lattice, A has a (unique) greatest lower
bound. That's the function you call.
I don't think so. The lattice property only says that the set A has a
glb within the equivalence class. AFAICT it doesn't promise that the
glb will be >= Q, so you can't necessarily use the glb as the function
to call. The reason why is that Q isn't necessarily part of the set of
given functions, so it's not necessarily one of the lower bounds that
the lattice property says there is a greatest of.
If that wasn't what you had in mind, then you are confusing a lattice
on the set of *all possible* functions with a lattice over the set of
functions that actually are present ... and it looks to me like
different parts of your argument assume different things.
An example, say you define foo(int2, float8) and now want to add foo(int8,
float4), then the system would reject that because of potential ambiguity
If your equivalence set contains just the four functions (abbreviating
in the obvious way)
f(i2, i2)
f(i2, f8)
f(i8, f4)
f(f8, f8)
then each two-element subset of this set has a unique glb and lub within
the set, which makes it a lattice if I'm reading my dusty old abstract
algebra textbook correctly. There may be a property that makes things
work the way you suggest but it's not the lattice property.
A more general comment is that mathematical purity is only one of the
considerations here, and not necessarily even one of the most important
considerations ;-)
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
So it would have to be more like this
int2 -> int4 -> int8 -> numeric
float4 -> float8 -> numeric
This tree is "correct" in the above sense but has a number of obvious
problems.
The most fundamental of which is that it violates SQL92: combining
exact and approximate numerics is supposed to yield approximate
(ie, float), not exact.
float[x] + numeric would now yield numeric. The solution is making an
explicit float8+numeric function. Okay, so at the end it's actually more
like 8 functions, but that's a price I'd be willing to pay. (Perhaps the
commutator mechanism could be extended to cover different types as well.)
Make that 8 functions for each and every one of the numeric operators.
I don't think that's reasonable... especially since those operators
cannot cause the overflow problem to go away. (The SQL guys probably
did not foresee people implementing NUMERIC with wider range than FLOAT
;-) ... but the fact that we did so doesn't give us license to ignore
that aspect of the spec ...)
numeric, any int/float operations would be promoted to numeric/numeric,
which is in any case the right thing to do.
No it isn't. See above.
regards, tom lane
Tom Lane writes:
(The SQL guys probably did not foresee people implementing NUMERIC
with wider range than FLOAT ;-) ... but the fact that we did so
doesn't give us license to ignore that aspect of the spec ...)
I think that must have been it, why else would they (implicitly) rank
floats above numerics. If we submit to that notion, then I agree with the
promotion tree you suggested.
The problem remains that upward casting will not be guaranteed to work all
the time, which is something that needs to be addressed; in potentially
unpretty ways, because not every casting decision is necessarily a linear
ladder-climb, it might be affected by other casting decisions going on in
parallel. (The workaround here could be to convert numerics that are wider
than floats to `infinity' :-)
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Tom Lane writes:
Er ... weren't you just objecting to metric-based resolution on the
grounds that it'd be unintelligible to users?
Well, since punting in the face of multiple possibilities isn't really an
improvement over that status quo, I kept looking. The key difference
between this (or similarly-minded) proposals and the context in which you
originally mentioned the metrics is that mine works at function creation
time, not at call resolution time.
A function creation is usually a well-controlled event, so an error
message along the lines of "if you create this function then you create an
ambiguity with function x because a call y could not be resolved
deterministically" is helpful. "Cannot resolve call x between y and z" at
call time is annoying (get the programer on the phone and ask him to clean
up his mess). Part of my proposal was a method for statically determining
ambiguities before it's too late.
Secondly, yes, lattices and related business are more complicated than say
a breadth-first search. But that need only be internally. Externally, it
tends to give intuitive behaviour because there is never an ambiguity,
whereas a BFS would presumably rely on the order of the arguments or other
such incidental things.
3) There exists at least one function P in the equivalence class such that
Q <= P. Consider the set A of all functions in the equivalence class that
are >= Q. Since the class is a lattice, A has a (unique) greatest lower
bound. That's the function you call.I don't think so. The lattice property only says that the set A has a
glb within the equivalence class. AFAICT it doesn't promise that the
glb will be >= Q, so you can't necessarily use the glb as the function
to call.
Since all functions in A are >=Q by definition, Q is at least _a_ lower
bound on A. The glb(A) is also a lower bound on A, and since it's the
greatest it must also be >=Q.
The case where glb(A)=Q is when the call Q matches one existing signature
exactly, in that case you call that function. Otherwise the glb represents
the "next function up", and if there is an algorithm to find the glb of a
poset then there is also an algorithm to resolve function calls.
A more general comment is that mathematical purity is only one of the
considerations here, and not necessarily even one of the most important
considerations ;-)
Well, at least it was fun for me to work it out. :-)
No, seriously, what are the considerations?
* ease of implementation
* efficiency
* predictability
* intuitiveness
* verifyability
* scalability
* simplicity
I think I got at least half of that covered, with no contenders yet at the
surface.
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
Peter Eisentraut <peter_e@gmx.net> writes:
I don't think so. The lattice property only says that the set A has a
glb within the equivalence class. AFAICT it doesn't promise that the
glb will be >= Q, so you can't necessarily use the glb as the function
to call.
Since all functions in A are >=Q by definition, Q is at least _a_ lower
bound on A. The glb(A) is also a lower bound on A, and since it's the
greatest it must also be >=Q.
No, you're not catching my point. glb(A) is the greatest lower bound
*within the set of available functions*. Q, the requested call
signature, is *not* in that set (if it were then we'd not have any
ambiguity to resolve, because there's an exact match). The fact that
the set of available functions forms a lattice gives you no guarantee
whatever that glb(A) >= Q, because Q is not constrained by the lattice
property.
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
(The workaround here could be to convert numerics that are wider
than floats to `infinity' :-)
Hmm, that's actually not a bad idea. Infinity is a pretty portable
notion these days, what with nearly everyone toeing the IEEE float
line ... so we could have numeric->float generate a NaN if possible
and only resort to an elog() on machines without NaN.
OTOH, no mathematician will accept the notion that 1e1000 is the
same as infinity ;-). Mathematical purity would probably favor
the elog.
Comments anyone?
regards, tom lane
One thing that has gotten lost here is whether there is any market at all
for putting in some line of defence against (a tbd degree of) ambiguity at
function creation time to reduce the possible problems (implementation and
user-side) at call time? What do you think?
Tom Lane writes:
glb(A) is the greatest lower bound *within the set of available
functions*.
Correct.
Q, the requested call signature, is *not* in that set
Correct.
The fact that the set of available functions forms a lattice gives you
no guarantee whatever that glb(A) >= Q, because Q is not constrained
by the lattice property.
I know. I don't use the lattice property to deduce that fact hat
glb(A)>=Q. I use the lattice property to derive the existance of glb(A).
The result glb(A)>=Q comes from
1. Q is a lower bound on A (by definition of A)
2. glb(A) is a lower bound on A (by definition of glb)
3. glb(A)>=Q (by definiton of "greatest")
Recall that A was defined as the set of functions >=Q in Q's equivalence
class, and was guaranteed to be non-empty by treating the other cases
separately.
I think it works. :) In all but the most complicated cases this really
decays to the obvious behaviour, but on the other hand it scales
infinitely.
--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden