Reducing the overhead of NUMERIC data

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#1)

Re: Reducing the overhead of NUMERIC data

Simon Riggs <simon@2ndquadrant.com> writes:

varlen is int32 to match the standard varlena header. However, the max
number of digits of the datatype is less than the threshold at which
values get toasted. So no NUMERIC values ever get toasted - in which
case, why worry about matching the size of varlena - lets reduce it to 2
bytes which still gives us up to 1000 digits as we have now.

Because that will require an extra case in the code that disassembles
tuples, which will slow down *everything* even in databases that don't
contain one single NUMERIC value. I think you need more than "let's
save 2 bytes in NUMERICs" to justify that.

n_weight seems to exist because we do not store trailing zeroes. So
1000000 is stored as 1 with a weight of 6. My experience is that large
numbers of trailing zeroes do not occur with any frequency in real
measurement or financial data and that this is an over-optimization.

That seems debatable. Keep in mind that not storing extra zeroes cuts
computation time as well as storage.

n_sign_dscale shows us where the decimal point is. We could actually
store a marker representing the decimal point, which would cost us 0.5
byte rather than 2 bytes. Since we have 4 bits to represent a decimal
number, that leaves a few bits spare to represent either a decimal-
point-and-positive-sign and decimal-point-and-negative-sign. (We would
still need to store trailing zeroes even after the decimal point).

This is completely bogus. How are you going to remember the sign except
by always storing a marker? ISTM this proposal just moves the
sign/decimalpoint overhead from one place to another, ie, somewhere in
the NumericDigit array instead of in a fixed field.

Also, you can't drop dscale without abandoning the efficient base-10000
representation, at least not unless you want people complaining that the
database shows NUMERIC(3) data with four decimal places.

It might be reasonable to restrict the range of NUMERIC to the point
that we could fit the weight/sign/dscale into 2 bytes instead of 4,
thereby saving 2 bytes per NUMERIC. I'm not excited about the other
aspects of this, though.

regards, tom lane

kleptog@svana.org

over 20 years ago

In reply to: Simon Riggs (#1)

Re: Reducing the overhead of NUMERIC data

On Tue, Nov 01, 2005 at 09:22:17PM +0000, Simon Riggs wrote:

varlen is int32 to match the standard varlena header. However, the max
number of digits of the datatype is less than the threshold at which
values get toasted. So no NUMERIC values ever get toasted - in which
case, why worry about matching the size of varlena - lets reduce it to 2
bytes which still gives us up to 1000 digits as we have now.

The other ideas may have merit, I don't know. But this one is a
no-goer. The backend currently has recognises three forms of Datum:

- Fixed length, By value:
integers, chars, anything short anough to fit in a word
- Fixed length, By reference:
datatime, etc, anything that's fixed length but too long for a word
- Variable length:
Anything variable: text, varchar(), etc

The last all, without exception, have a varlena header. This makes the
code easy, because all variable length values look the same for
copying, loading, storing, etc.

You are proposing a fourth type, say VARLENA2 which looks a lot like a
verlena but it's not. I think the shear volume of code that would need
to be checked is huge. Also, things like pg_attribute would need
changing because you have to represent this new state somehow.

I seriously doubt this isn't going to happen. Your other possible
optimisations have other issues.

n_weight seems to exist because we do not store trailing zeroes. So
1000000 is stored as 1 with a weight of 6. My experience is that large
numbers of trailing zeroes do not occur with any frequency in real
measurement or financial data and that this is an over-optimization.
This is probably a hang over from the original algorithm, rather than a
conscious design goal for PostgreSQL?

But if you are storing large numbers then it's helpful. Whether it's
worth the cost...

n_sign_dscale shows us where the decimal point is. We could actually
store a marker representing the decimal point, which would cost us 0.5
byte rather than 2 bytes. Since we have 4 bits to represent a decimal
number, that leaves a few bits spare to represent either a decimal-
point-and-positive-sign and decimal-point-and-negative-sign. (We would
still need to store trailing zeroes even after the decimal point).

Consider the algorithm: A number is stored as base + exponent. To
multiply two numbers you can multiply the bases and add the exponents.
OTOH, if you store the decimal inside the data, now you have to extract
it again before you can do any calculating. So you've traded CPU time
for disk space. Is diskspace cheaper or more expensive than CPU?
Debatable I guess.

So, assuming I have this all correct, means we could reduce the on disk
storage for NUMERIC datatypes to the following struct. This gives an
overhead of just 2.5 bytes, plus the loss of the optimization of
trailing zeroes, which I assess as having almost no value anyway in
99.9999% of data values (literally...).

Actually, I have a table with a column declared as numeric(12,4)
because there has to be 4 decimal places. As it turns out, the decimal
places are mostly zero so the optimisation works for me.

Interesting ideas, but there's a lot of hurdles to jump I think...
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Tom Lane (#2)

Re: Reducing the overhead of NUMERIC data

On Tue, Nov 01, 2005 at 04:54:11PM -0500, Tom Lane wrote:

It might be reasonable to restrict the range of NUMERIC to the point
that we could fit the weight/sign/dscale into 2 bytes instead of 4,
thereby saving 2 bytes per NUMERIC. I'm not excited about the other
aspects of this, though.

FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
fit length info into 1 or 2 bytes. So there's something to be said for a
small numeric type that has less overhead and a large numeric (what we
have today).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Martijn van Oosterhout (#3)

Re: Reducing the overhead of NUMERIC data

On Tue, Nov 01, 2005 at 11:16:58PM +0100, Martijn van Oosterhout wrote:

Consider the algorithm: A number is stored as base + exponent. To
multiply two numbers you can multiply the bases and add the exponents.
OTOH, if you store the decimal inside the data, now you have to extract
it again before you can do any calculating. So you've traded CPU time
for disk space. Is diskspace cheaper or more expensive than CPU?
Debatable I guess.

Well, I/O bandwidth is much more expensive than either CPU or disk
space...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Martijn van Oosterhout (#3)

Re: Reducing the overhead of NUMERIC data

Martijn van Oosterhout <kleptog@svana.org> writes:

You are proposing a fourth type, say VARLENA2 which looks a lot like a
verlena but it's not. I think the shear volume of code that would need
to be checked is huge. Also, things like pg_attribute would need
changing because you have to represent this new state somehow.

It wouldn't be an impossible amount of code --- for precedent see back
when we made cstring into a full-fledged datatype (cstring is already
a fourth option in your list BTW). That patch wasn't all that large IIRC.
The issue in my mind is the performance implications of adding an
additional case to places that are already hotspots. There were
compelling functional reasons to pay that price to make cstring work,
but "save 2 bytes per numeric" doesn't seem like it rises to that level.
Maybe if we had a few other datatypes that could also use the feature.
[ thinks... ] inet/cidr comes to mind but I don't see any others.
The case seems a bit weak :-(

regards, tom lane

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#2)

Re: Reducing the overhead of NUMERIC data

On Tue, 2005-11-01 at 16:54 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

varlen is int32 to match the standard varlena header. However, the max
number of digits of the datatype is less than the threshold at which
values get toasted. So no NUMERIC values ever get toasted - in which
case, why worry about matching the size of varlena - lets reduce it to 2
bytes which still gives us up to 1000 digits as we have now.

Because that will require an extra case in the code that disassembles
tuples, which will slow down *everything* even in databases that don't
contain one single NUMERIC value. I think you need more than "let's
save 2 bytes in NUMERICs" to justify that.

n_weight seems to exist because we do not store trailing zeroes. So
1000000 is stored as 1 with a weight of 6. My experience is that large
numbers of trailing zeroes do not occur with any frequency in real
measurement or financial data and that this is an over-optimization.

That seems debatable. Keep in mind that not storing extra zeroes cuts
computation time as well as storage.

Check what % difference this makes. 2 bytes on everything makes more
difference than a 1 byte saving on a few percent of values.

n_sign_dscale shows us where the decimal point is. We could actually
store a marker representing the decimal point, which would cost us 0.5
byte rather than 2 bytes. Since we have 4 bits to represent a decimal
number, that leaves a few bits spare to represent either a decimal-
point-and-positive-sign and decimal-point-and-negative-sign. (We would
still need to store trailing zeroes even after the decimal point).

This is completely bogus. How are you going to remember the sign except
by always storing a marker? ISTM this proposal just moves the
sign/decimalpoint overhead from one place to another, ie, somewhere in
the NumericDigit array instead of in a fixed field.

That is exactly my proposal. Thus an overhead of 0.5 bytes rather than 2
bytes, as I explained....but

Also, you can't drop dscale without abandoning the efficient base-10000
representation, at least not unless you want people complaining that the
database shows NUMERIC(3) data with four decimal places.

... I take it I have misunderstood the storage format.

It might be reasonable to restrict the range of NUMERIC to the point
that we could fit the weight/sign/dscale into 2 bytes instead of 4,
thereby saving 2 bytes per NUMERIC. I'm not excited about the other
aspects of this, though.

That seems easily doable - it seemed like something would stick.

Restricting total number of digits to 255 and maxscale of 254 would
allow that saving, yes?

We can then have a BIGNUMERIC which would allow up to 1000 digits for
anybody out there that ever got that high. I'm sure there's a few %, so
I won't dismiss you entirely, guys...

Best Regards, Simon Riggs

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Tom Lane (#6)

Re: Reducing the overhead of NUMERIC data

On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

You are proposing a fourth type, say VARLENA2 which looks a lot like a
verlena but it's not. I think the shear volume of code that would need
to be checked is huge. Also, things like pg_attribute would need
changing because you have to represent this new state somehow.

It wouldn't be an impossible amount of code --- for precedent see back
when we made cstring into a full-fledged datatype (cstring is already
a fourth option in your list BTW). That patch wasn't all that large IIRC.
The issue in my mind is the performance implications of adding an
additional case to places that are already hotspots. There were
compelling functional reasons to pay that price to make cstring work,
but "save 2 bytes per numeric" doesn't seem like it rises to that level.
Maybe if we had a few other datatypes that could also use the feature.
[ thinks... ] inet/cidr comes to mind but I don't see any others.
The case seems a bit weak :-(

Would varchar(255) fit into that case? There's a heck of a lot of people
who use that as "well, dunno how big this is so I'll just use 255". A
better use case would be places where you know you'll only need 10-20
characters; saving 2 bytes in those cases would likely be worth it. This
would work for char as well (and given that people are probably not in
the habit of defining very large char's it would probably be even more
useful there).

Of course that means either yet another varchar/char type, or we have
some automatic cut-over for fields defined over a certain size...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Jim Nasby (#4)

Re: Reducing the overhead of NUMERIC data

"Jim C. Nasby" <jnasby@pervasive.com> writes:

FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
fit length info into 1 or 2 bytes. So there's something to be said for a
small numeric type that has less overhead and a large numeric (what we
have today).

I don't think it'd be worth having 2 types. Remember that the weight is
measured in base-10k digits. Suppose for instance
sign 1 bit
weight 7 bits (-64 .. +63)
dscale 8 bits (0..255)
This gives us a dynamic range of 1e-256 to 1e255 as well as the ability
to represent up to 255 displayable fraction digits. Does anyone know
any real database applications where that's not enough?

(I'm neglecting NaN here in supposing we need only 1 bit for sign,
but we could have a special encoding for NaN. Perhaps disallow the
weight = -64 case and use that to signal NaN.)

regards, tom lane

#10

simon@2ndQuadrant.com

over 20 years ago

In reply to: Martijn van Oosterhout (#3)

Re: Reducing the overhead of NUMERIC data

On Tue, 2005-11-01 at 23:16 +0100, Martijn van Oosterhout wrote:

lots of useful things, thank you.

So, assuming I have this all correct, means we could reduce the on disk
storage for NUMERIC datatypes to the following struct. This gives an
overhead of just 2.5 bytes, plus the loss of the optimization of
trailing zeroes, which I assess as having almost no value anyway in
99.9999% of data values (literally...).

Actually, I have a table with a column declared as numeric(12,4)
because there has to be 4 decimal places. As it turns out, the decimal
places are mostly zero so the optimisation works for me.

Of course it fits some data. The point is whether it is useful for most
people's data.

My contention is that *most* (but definitely nowhere near all) NUMERIC
data is either financial or measured data. That usually means it has
digits that follow Benfold's Law - which for this discussion is a
variant on a uniform random distribution.

Optimizing for trailing zeroes just isn't worth the very minimal
benefits, in most cases. It doesn't really matter that it saves on
storage and processing time in those cases - Amdahl's Law says we can
ignore that saving because the optimized case is not prevalent enough
for us to care.

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes? I'd be happy to
be proved wrong with an analysis of real data tables.

Best Regards, Simon Riggs

#11

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Jim Nasby (#8)

Re: Reducing the overhead of NUMERIC data

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Tue, Nov 01, 2005 at 05:40:35PM -0500, Tom Lane wrote:

Maybe if we had a few other datatypes that could also use the feature.
[ thinks... ] inet/cidr comes to mind but I don't see any others.
The case seems a bit weak :-(

Would varchar(255) fit into that case?

That's attractive at first thought, but not when you stop to consider
that most of the string-datatype support is built around the assumption
that text, varchar, and char share the same underlying representation.
You'd have to write a whole bunch of new code to support such a
datatype.

regards, tom lane

#12

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#10)

Re: Reducing the overhead of NUMERIC data

Simon Riggs <simon@2ndquadrant.com> writes:

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes?

Don't forget leading zeroes. And all-zero (we omit digits entirely in
that case). I don't think you can claim that zero isn't a common case.

regards, tom lane

#13

J. Andrew Rogers

jrogers@neopolitan.com

over 20 years ago

In reply to: Jim Nasby (#4)

Re: Reducing the overhead of NUMERIC data

On 11/1/05 2:38 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
fit length info into 1 or 2 bytes. So there's something to be said for a
small numeric type that has less overhead and a large numeric (what we
have today).

The 38 digit limit is the decimal size of a 128-bit signed integer. The
optimization has less to do with the size of the length info and more to do
with fast math and fixed structure size.

J. Andrew Rogers

#14

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#12)

Re: Reducing the overhead of NUMERIC data

On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes?

Don't forget leading zeroes. And all-zero (we omit digits entirely in
that case). I don't think you can claim that zero isn't a common case.

The question is: how common?

For INTEGERs I would accept that many are often zero. For NUMERIC, these
are seldom exactly zero, IMHO.

This is one of those issues where we need to run tests and take input.
We cannot decide this sort of thing just by debate alone. So, I'll leave
this as a less potentially fruitful line of enquiry.

Best Regards, Simon Riggs

#15

Mike Rylander

mrylander@gmail.com

over 20 years ago

In reply to: Simon Riggs (#14)

Re: Reducing the overhead of NUMERIC data

On 11/2/05, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes?

Don't forget leading zeroes. And all-zero (we omit digits entirely in
that case). I don't think you can claim that zero isn't a common case.

The question is: how common?

For INTEGERs I would accept that many are often zero. For NUMERIC, these
are seldom exactly zero, IMHO.

Seconded. My INTEGER data does have a quite a few zeros but most of
my NUMERIC columns hold debits and credits. Those are almost never
zero.

This is one of those issues where we need to run tests and take input.
We cannot decide this sort of thing just by debate alone. So, I'll leave
this as a less potentially fruitful line of enquiry.

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

#16

Pollard, Mike

mpollard@cincom.com

over 20 years ago

In reply to: Mike Rylander (#15)

Re: Reducing the overhead of NUMERIC data

I am not able to quickly find your numeric format, so I'll just throw
this in. MaxDB (I only mention this because the format and algorithms
are now under the GPL, so they can be reviewed by the public) uses a
nifty number format that allows the use memcpy to compare two numbers
when they are in the same precision and scale. Basically, the first
byte contains the sign and number of digits in the number (number of
digits is complemented if the number is negative), then the next N bytes
contain the actual decimal digits, where N is the number of decimal
digits / 2 (so two decimal digits per byte). Trailing 0's are removed
to save space. So,

0 is stored as {128}
1 is stored as {193, 16}
1000 is stored as {196, 16}
1001 is stored as {196, 16, 1} x{C4 10 01}
-1 is stored as {63, 144}
-1001 is stored as {60, 144}

Their storage allows for a max of 63 digits in a number, but it should
be no problem to increase the size to 2 bytes, thus allowing up to
16,385 digits.

The advantages are:
- ability to memcmp two numbers.
- compact storage (can be made more compact if you choose to
save hex digits instead of decimal, but I'm not sure you want to do
that).

The disadvantages are as follows:
- this format does not remember the database definition for the
number (that is, no precision or scale); numeric functions must be told
what they are. It would be nice if the number kept track of that as
well...
- comparing two numbers that are not the same precision and
scale means converting one or both (if both precision and scale are
different you may have to convert both)
- calculations (addition, subtraction, etc) require functions to
extract the digits and do the calculation a digit at a time.
- I do not know of any trig functions, so they would need to be
written

If any one is interested, I would be happy to discuss this further.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc

Import Notes

Resolved by subject fallback

#17

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#9)

Re: [HACKERS] Reducing the overhead of NUMERIC data

On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

FWIW, most databases I've used limit NUMERIC to 38 digits, presumably to
fit length info into 1 or 2 bytes. So there's something to be said for a
small numeric type that has less overhead and a large numeric (what we
have today).

I don't think it'd be worth having 2 types. Remember that the weight is
measured in base-10k digits. Suppose for instance
sign 1 bit
weight 7 bits (-64 .. +63)
dscale 8 bits (0..255)
This gives us a dynamic range of 1e-256 to 1e255 as well as the ability
to represent up to 255 displayable fraction digits. Does anyone know
any real database applications where that's not enough?

(I'm neglecting NaN here in supposing we need only 1 bit for sign,
but we could have a special encoding for NaN. Perhaps disallow the
weight = -64 case and use that to signal NaN.)

I've coded a short patch to do this, which is the result of two
alternate patches and some thinking, but maybe not enough yet.

The patch given here is different on two counts from above:

This sets...
#define NUMERIC_MAX_PRECISION 64

since

#define NUMERIC_MAX_RESULT_SCALE (NUMERIC_MAX_PRECISION * 2)

We don't seem to be able to use all of the bits actually available to us
in the format. Perhaps we need to decouple these now? Previously, we had
room for 14 bits, which gave a maximum of 16384. We were using
NUMERIC_MAX of 1000, so doubling it didn't give problems.

The above on-disk format showed sign & weight together, whereas the
current code has sign and dscale together. Trying to put sign and weight
together is somewhat difficult, since weight is itself a signed value.

I coded it up that way around, which is reasonably straightforward but
harder than the patch enclosed here. But AFAICS - which isn't that far
normally I grant you, doing things that way around would require some
twos-complement work to get things correct when weight is negative. That
worries me.

IMHO we should accept the step down in maximum numeric precision (down
to "only" 64 digits) rather than put extra processing into every
manipulation of a NUMERIC datatype. With luck, I've misunderstood and we
can have both performance and precision.

If not, I commend 64 digits to you as sufficient for every imaginable
purpose - saving 2 bytes off every numeric column. (And still 28 decimal
places more accurate than Oracle).

Best Regards, Simon Riggs

#18

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#17)

Re: [HACKERS] Reducing the overhead of NUMERIC data

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:

I don't think it'd be worth having 2 types. Remember that the weight is
measured in base-10k digits. Suppose for instance
sign 1 bit
weight 7 bits (-64 .. +63)
dscale 8 bits (0..255)

I've coded a short patch to do this, which is the result of two
alternate patches and some thinking, but maybe not enough yet.

What your patch does is

sign 2 bits
weight 8 bits (-128..127)
dscale 6 bits (0..63)

which is simply pretty lame: weight effectively has a factor of 8 more
dynamic range than dscale in this representation. What's the point of
being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
only lets you show 63 fractional digits? You've got to allocate the
bits in a saner fashion. Yes, that takes a little more work.

Also, since the internal (unpacked) calculation representation has a
much wider dynamic range than this, it'd probably be appropriate to add
some range checks to the code that forms a packed value from unpacked.

regards, tom lane

#19

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Simon Riggs (#14)

Re: Reducing the overhead of NUMERIC data

On Wed, Nov 02, 2005 at 08:48:25AM +0000, Simon Riggs wrote:

On Tue, 2005-11-01 at 18:15 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Anybody like to work out a piece of SQL to perform data profiling and
derive the distribution of values with trailing zeroes?

Don't forget leading zeroes. And all-zero (we omit digits entirely in
that case). I don't think you can claim that zero isn't a common case.

The question is: how common?

For INTEGERs I would accept that many are often zero. For NUMERIC, these
are seldom exactly zero, IMHO.

This is one of those issues where we need to run tests and take input.
We cannot decide this sort of thing just by debate alone. So, I'll leave
this as a less potentially fruitful line of enquiry.

Is it worth comming up with some script that users can run against a
table to provide us with real data?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#20

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#18)

Re: [HACKERS] Reducing the overhead of NUMERIC data

On Wed, 2005-11-02 at 13:46 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2005-11-01 at 17:55 -0500, Tom Lane wrote:

I don't think it'd be worth having 2 types. Remember that the weight is
measured in base-10k digits. Suppose for instance
sign 1 bit
weight 7 bits (-64 .. +63)
dscale 8 bits (0..255)

I've coded a short patch to do this, which is the result of two
alternate patches and some thinking, but maybe not enough yet.

What your patch does is

Thanks for checking this so quickly.

sign 2 bits

OK, thats just a mistake in my second patch. Thats easily corrected.
Please ignore that for now.

weight 8 bits (-128..127)
dscale 6 bits (0..63)

which is simply pretty lame: weight effectively has a factor of 8 more
dynamic range than dscale in this representation. What's the point of
being able to represent 1 * 10000 ^ -128 (ie, 10^-512) if the dscale
only lets you show 63 fractional digits? You've got to allocate the
bits in a saner fashion. Yes, that takes a little more work.

I wasn't trying to claim the bit assignment made sense. My point was
that the work to mangle the two fields together to make it make sense
looked like it would take more CPU (since the standard representation of
signed integers is different for +ve and -ve values). It is the more CPU
I'm worried about, not the wasted bits on the weight. Spending CPU
cycles on *all* numerics just so we can have numbers with > +/-64
decimal places doesn't seem a good trade. Hence I stuck the numeric sign
back on the dscale, and so dscale and weight seem out of balance.

So, AFAICS, the options are:
0 (current cvstip)
Numeric range up to 1000, with additional 2 bytes per column value
1. Numeric range up to 128, but with overhead to extract last bit
2. Numeric range up to 64

I'm suggesting we choose (2).... other views are welcome.

(I'll code it whichever way we decide.)

Also, since the internal (unpacked) calculation representation has a
much wider dynamic range than this, it'd probably be appropriate to add
some range checks to the code that forms a packed value from unpacked.

Well, there already is one that does that, otherwise I would have added
one as you suggest. (The unpacked code has int values, whereas the
previous packed format used u/int16 values).

Best Regards, Simon Riggs

#21

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#20)

#22

kleptog@svana.org

over 20 years ago

In reply to: Jim Nasby (#19)

#23

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#21)

#24

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#21)

#25

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#24)

#26

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Tom Lane (#25)

#27

Andrew Dunstan

andrew@dunslane.net

over 20 years ago

In reply to: Tom Lane (#25)

#28

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Andrew Dunstan (#27)

#29

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#28)

#30

simon@2ndQuadrant.com

over 20 years ago

In reply to: Simon Riggs (#29)

#31

kleptog@svana.org

over 20 years ago

In reply to: Simon Riggs (#30)

#32

Alvaro Herrera

alvherre@2ndquadrant.com

over 20 years ago

In reply to: Simon Riggs (#30)

#33

simon@2ndQuadrant.com

over 20 years ago

In reply to: Alvaro Herrera (#32)

#34

Stephan Szabo

sszabo@megazone23.bigpanda.com

over 20 years ago

In reply to: Simon Riggs (#29)

#35

Marcus Engene

mengpg@engene.se

over 20 years ago

In reply to: Simon Riggs (#33)

#36

Bruce Momjian

bruce@momjian.us

over 20 years ago

In reply to: Stephan Szabo (#34)

#37

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Bruce Momjian (#36)

#38

simon@2ndQuadrant.com

over 20 years ago

In reply to: Stephan Szabo (#34)

#39

Andrew Dunstan

andrew@dunslane.net

over 20 years ago

In reply to: Simon Riggs (#29)

#40

Mark Mielke

mark@mark.mielke.cc

over 20 years ago

In reply to: Martijn van Oosterhout (#31)

#41

simon@2ndQuadrant.com

over 20 years ago

In reply to: Andrew Dunstan (#39)

#42

kleptog@svana.org

over 20 years ago

In reply to: Mark Mielke (#40)

#43

gmaxwell@gmail.com

over 20 years ago

In reply to: Martijn van Oosterhout (#42)

#44

Andrew - Supernews

andrew+nonews@supernews.com

over 20 years ago

In reply to: Simon Riggs (#20)

#45

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#37)

#46

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Gregory Maxwell (#43)

#47

Mark Mielke

mark@mark.mielke.cc

over 20 years ago

In reply to: Tom Lane (#46)

#48

kleptog@svana.org

over 20 years ago

In reply to: Mark Mielke (#47)

#49

Mark Mielke

mark@mark.mielke.cc

over 20 years ago

In reply to: Martijn van Oosterhout (#48)

#50

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Mark Mielke (#49)

#51

kleptog@svana.org

over 20 years ago

In reply to: Tom Lane (#50)

#52

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Martijn van Oosterhout (#51)

#53

gmaxwell@gmail.com

over 20 years ago

In reply to: Martijn van Oosterhout (#51)

#54

gmaxwell@gmail.com

over 20 years ago

In reply to: Tom Lane (#52)

#55

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Marcus Engene (#35)

#56

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Tom Lane (#37)

#57

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Jim Nasby (#56)

#58

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Tom Lane (#57)

#59

kleptog@svana.org

over 20 years ago

In reply to: Gregory Maxwell (#54)

#60

gmaxwell@gmail.com

over 20 years ago

In reply to: Martijn van Oosterhout (#59)

#61

Qingqing Zhou

zhouqq@cs.toronto.edu

over 20 years ago

In reply to: Simon Riggs (#1)

#62

kleptog@svana.org

over 20 years ago

In reply to: Gregory Maxwell (#60)

#63

Harald Fuchs

hf0923x@protecting.net

over 20 years ago

In reply to: Simon Riggs (#30)

#64

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#37)

#65

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Simon Riggs (#64)

#66

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#65)

#67