A space-efficient, user-friendly way to store categorical data

Started by Andrew Kanealmost 8 years ago12 messages
#1Andrew Kane
andrew@chartkick.com

Hi,

I'm hoping to get feedback on an idea for a new data type to allow for
efficient storage of text values while keeping reads and writes
user-friendly. Suppose you want to store categorical data like current city
for users. There will be a long list of cities, and many users will have
the same city. Some options are:

- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but
complicates reads and writes

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

It'd be great to hear what others think of this (or if I'm missing
something). Another direction could be to deduplicate values for TOAST-able
data types.

Thanks,
Andrew

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Kane (#1)
Re: A space-efficient, user-friendly way to store categorical data

Andrew Kane <andrew@chartkick.com> writes:

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

You realize, of course, that it's possible to add labels to an enum type
today. (Removing them is another story.)

You haven't explained exactly what you have in mind that is going to be
able to duplicate the advantages of the current enum implementation
without its disadvantages, so it's hard to evaluate this proposal.

regards, tom lane

#3Andrew Dunstan
andrew.dunstan@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: A space-efficient, user-friendly way to store categorical data

On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Kane <andrew@chartkick.com> writes:

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

You realize, of course, that it's possible to add labels to an enum type
today. (Removing them is another story.)

You haven't explained exactly what you have in mind that is going to be
able to duplicate the advantages of the current enum implementation
without its disadvantages, so it's hard to evaluate this proposal.

This sounds rather like the idea I have been tossing around in my head
for a while, and in sporadic discussions with a few people, for a
dictionary object. The idea is to have an append-only list of labels
which would not obey transactional semantics, and would thus help us
avoid the pitfalls of enums - there wouldn't be any rollback of an
addition. The use case would be for a jsonb representation which
would replace object keys with the oid value of the corresponding
dictionary entry rather like enums now. We could have a per-table
dictionary which in most typical json use cases would be very small,
and we know from some experimental data that the compression in space
used from such a change would often be substantial.

This would have to be modifiable dynamically rather than requiring
explicit additions to the dictionary, to be of practical use for the
jsonb case, I believe.

I hadn't thought about this as a sort of super enum that was usable
directly by users, but it makes sense.

I have no idea how hard or even possible it would be to implement.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andrew Dunstan (#3)
Re: A space-efficient, user-friendly way to store categorical data

On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Kane <andrew@chartkick.com> writes:

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

You realize, of course, that it's possible to add labels to an enum type
today. (Removing them is another story.)

You haven't explained exactly what you have in mind that is going to be
able to duplicate the advantages of the current enum implementation
without its disadvantages, so it's hard to evaluate this proposal.

This sounds rather like the idea I have been tossing around in my head
for a while, and in sporadic discussions with a few people, for a
dictionary object. The idea is to have an append-only list of labels
which would not obey transactional semantics, and would thus help us
avoid the pitfalls of enums - there wouldn't be any rollback of an
addition. The use case would be for a jsonb representation which
would replace object keys with the oid value of the corresponding
dictionary entry rather like enums now. We could have a per-table
dictionary which in most typical json use cases would be very small,
and we know from some experimental data that the compression in space
used from such a change would often be substantial.

This would have to be modifiable dynamically rather than requiring
explicit additions to the dictionary, to be of practical use for the
jsonb case, I believe.

I hadn't thought about this as a sort of super enum that was usable
directly by users, but it makes sense.

I have no idea how hard or even possible it would be to implement.

I have had thoughts over the years about something similar, but going
the other way and hiding it from the end user. If you could declare a
column to have a special compressed property (independently of the
type) then it could either automatically maintain a dictionary, or at
least build a new dictionary for your when you next run some kind of
COMPRESS operation. There would be no user visible difference except
footprint. In ancient DB2 they had a column property along those
lines called "VALUE COMPRESSION" (they also have a row-level version,
and now they have much more advanced kinds of adaptive compression
that I haven't kept up with). In some ways it'd be a bit like toast
with shared entries, but I haven't seriously looked into how such a
thing might be implemented.

--
Thomas Munro
http://www.enterprisedb.com

#5Joe Conway
mail@joeconway.com
In reply to: Thomas Munro (#4)
Re: A space-efficient, user-friendly way to store categorical data

On 02/11/2018 10:06 PM, Thomas Munro wrote:

On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:

On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Kane <andrew@chartkick.com> writes:

A better option could be a new "dynamic enum" type, which would have
similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

You realize, of course, that it's possible to add labels to an enum type
today. (Removing them is another story.)

You haven't explained exactly what you have in mind that is going to be
able to duplicate the advantages of the current enum implementation
without its disadvantages, so it's hard to evaluate this proposal.

This sounds rather like the idea I have been tossing around in my head
for a while, and in sporadic discussions with a few people, for a
dictionary object. The idea is to have an append-only list of labels
which would not obey transactional semantics, and would thus help us
avoid the pitfalls of enums - there wouldn't be any rollback of an
addition. The use case would be for a jsonb representation which
would replace object keys with the oid value of the corresponding
dictionary entry rather like enums now. We could have a per-table
dictionary which in most typical json use cases would be very small,
and we know from some experimental data that the compression in space
used from such a change would often be substantial.

This would have to be modifiable dynamically rather than requiring
explicit additions to the dictionary, to be of practical use for the
jsonb case, I believe.

I hadn't thought about this as a sort of super enum that was usable
directly by users, but it makes sense.

I have no idea how hard or even possible it would be to implement.

I have had thoughts over the years about something similar, but going
the other way and hiding it from the end user. If you could declare a
column to have a special compressed property (independently of the
type) then it could either automatically maintain a dictionary, or at
least build a new dictionary for your when you next run some kind of
COMPRESS operation. There would be no user visible difference except
footprint. In ancient DB2 they had a column property along those
lines called "VALUE COMPRESSION" (they also have a row-level version,
and now they have much more advanced kinds of adaptive compression
that I haven't kept up with). In some ways it'd be a bit like toast
with shared entries, but I haven't seriously looked into how such a
thing might be implemented.

For what it is worth, there is a similar concept in R called "factors".
When categorical data is stored in a data.frame (R language equivalent
to relations) it is transparently and automatically converted. I believe
this is both for storage compression and to facilitate some of the
analytics. In R you can also explicitly specify to *not* convert strings
to factors as a performance optimization, because that conversion does
have a noticeable impact during ingestion and is not always needed.

I can also envision usefulness of this type of mechanism in other
security related scenarios.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#6Mark Dilger
hornschnorter@gmail.com
In reply to: Andrew Kane (#1)
Re: A space-efficient, user-friendly way to store categorical data

On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew@chartkick.com> wrote:

Hi,

I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping reads and writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a long list of cities, and many users will have the same city. Some options are:

- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but complicates reads and writes

A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but instead of labels being declared ahead of time, they would be added as data is inserted.

It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicate values for TOAST-able data types.

I have already done this and use it in production. My implementation is
specific to my needs, and would have to be made more general purpose if
it were ever to be accepted by this community, but I'll lay out the
basic design for your consideration.

I maintain multiple mappings of text to/from Oid.

Some of the names, typically the most common ones that I am likely to
encounter, are known at compile time. I have a quick hardcoded lookup
that maps these names to their compile time assigned Oid. Conversions
from the name to Oid or visa versa happen without entailing any shared
lock overhead.

For each mapping, I maintain a separate catalog table. The compile time
known values are inserted into the catalog with DATA lines, per the
usual mechanism for populating catalog tables during bootstrap.

For now, keeping the functions that map Oids to and from compile time
known strings synchronized with the DATA lines is a manual PITA process,
except that I almost never need to extend the mappings, and as such have
not yet bothered to write a perl script for managing it.

For each mapping, I also have a cache. See syscache.h. I have
functions defined that convert between names and Oids that check the
hardcoded values, then the cache, then the full catalog table if not
found in the cache. It all works transactionally and plays well with
other processes that are adding or looking up values simultaneously.
There are more than just 'get' and 'set' functions defined, because
sometimes when you 'get' a value, you want it to be assigned an entry if
it does not exist yet, and sometimes you don't.

For a few of the mappings, where I know the list will never get too
large, I map from text to one or two bytes rather than to a full four
byte Oid. This helps, because I have other tables with multiple columns
that are mappings of this sort, such that I can pack two or three
mapping columns into just 4 bytes. Whether you'd get a similar payoff
depends a lot on your table structure. Since I'm using syscache, I have
to store a full four bytes there, but the tables that reference that
don't have to spend a full four bytes. To make the conversion easier, I
have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
range above 10000 that is reserved for their use, with a mapping to/from
the lower range [0..255] or [0..65535] automatically performed when
casting to/from Oid.

The main reason I prefer this over the built-in enum type is that for
the data distribution I am handling, I have greater than 90% of the
lookups succeed without looking beyond the compile-time known values.
(For some mappings, it's really more like 100%, with the possibility
of run-time added values being a theoretical possibility that I don't
want to exclude, but as yet has never happened in practice.)

The primary drawback of my implementation from the point of view of
making it general purpose is that you need to know something about your
data distribution before you compile postgres. AFAIK, most postgres
users don't compile their own copy, but use an RPM or some such, and
that puts my type of solution out of reach for them.

The second most obvious problem with my implementation is that it lacks
any means for removing values from the mappings. Once added, they are
never removed. For me, this is a good thing, because it means I don't
need to create indexes referencing a master table of mappings to
guarantee data integrity, and for my workload, none of the mappings are
the kind of thing where mappings data would ever get expired.

I am in a really small niche situation that would likely not be
applicable for most other users.

Extending this to help the community as a whole in the form of something
general purpose seems difficult, which is why I never tried to
contribute it.

Could you perhaps tell me how similar your situation is to mine, and
whether my approach is anything like what you were contemplating? I am
curious to hear an answer to Tom's question, upthread, about why you
would not just use the enumeration type that postgres already supplies.

Mark Dilger

#7Andrew Kane
andrew@chartkick.com
In reply to: Mark Dilger (#6)
Re: A space-efficient, user-friendly way to store categorical data

Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.

-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);

-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

What would be ideal:

-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);

-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

- Andrew

On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger <hornschnorter@gmail.com>
wrote:

Show quoted text

On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew@chartkick.com> wrote:

Hi,

I'm hoping to get feedback on an idea for a new data type to allow for

efficient storage of text values while keeping reads and writes
user-friendly. Suppose you want to store categorical data like current city
for users. There will be a long list of cities, and many users will have
the same city. Some options are:

- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but

complicates reads and writes

A better option could be a new "dynamic enum" type, which would have

similar storage requirements as an enum, but instead of labels being
declared ahead of time, they would be added as data is inserted.

It'd be great to hear what others think of this (or if I'm missing

something). Another direction could be to deduplicate values for TOAST-able
data types.

I have already done this and use it in production. My implementation is
specific to my needs, and would have to be made more general purpose if
it were ever to be accepted by this community, but I'll lay out the
basic design for your consideration.

I maintain multiple mappings of text to/from Oid.

Some of the names, typically the most common ones that I am likely to
encounter, are known at compile time. I have a quick hardcoded lookup
that maps these names to their compile time assigned Oid. Conversions
from the name to Oid or visa versa happen without entailing any shared
lock overhead.

For each mapping, I maintain a separate catalog table. The compile time
known values are inserted into the catalog with DATA lines, per the
usual mechanism for populating catalog tables during bootstrap.

For now, keeping the functions that map Oids to and from compile time
known strings synchronized with the DATA lines is a manual PITA process,
except that I almost never need to extend the mappings, and as such have
not yet bothered to write a perl script for managing it.

For each mapping, I also have a cache. See syscache.h. I have
functions defined that convert between names and Oids that check the
hardcoded values, then the cache, then the full catalog table if not
found in the cache. It all works transactionally and plays well with
other processes that are adding or looking up values simultaneously.
There are more than just 'get' and 'set' functions defined, because
sometimes when you 'get' a value, you want it to be assigned an entry if
it does not exist yet, and sometimes you don't.

For a few of the mappings, where I know the list will never get too
large, I map from text to one or two bytes rather than to a full four
byte Oid. This helps, because I have other tables with multiple columns
that are mappings of this sort, such that I can pack two or three
mapping columns into just 4 bytes. Whether you'd get a similar payoff
depends a lot on your table structure. Since I'm using syscache, I have
to store a full four bytes there, but the tables that reference that
don't have to spend a full four bytes. To make the conversion easier, I
have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
range above 10000 that is reserved for their use, with a mapping to/from
the lower range [0..255] or [0..65535] automatically performed when
casting to/from Oid.

The main reason I prefer this over the built-in enum type is that for
the data distribution I am handling, I have greater than 90% of the
lookups succeed without looking beyond the compile-time known values.
(For some mappings, it's really more like 100%, with the possibility
of run-time added values being a theoretical possibility that I don't
want to exclude, but as yet has never happened in practice.)

The primary drawback of my implementation from the point of view of
making it general purpose is that you need to know something about your
data distribution before you compile postgres. AFAIK, most postgres
users don't compile their own copy, but use an RPM or some such, and
that puts my type of solution out of reach for them.

The second most obvious problem with my implementation is that it lacks
any means for removing values from the mappings. Once added, they are
never removed. For me, this is a good thing, because it means I don't
need to create indexes referencing a master table of mappings to
guarantee data integrity, and for my workload, none of the mappings are
the kind of thing where mappings data would ever get expired.

I am in a really small niche situation that would likely not be
applicable for most other users.

Extending this to help the community as a whole in the form of something
general purpose seems difficult, which is why I never tried to
contribute it.

Could you perhaps tell me how similar your situation is to mine, and
whether my approach is anything like what you were contemplating? I am
curious to hear an answer to Tom's question, upthread, about why you
would not just use the enumeration type that postgres already supplies.

Mark Dilger

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Kane (#7)
Re: A space-efficient, user-friendly way to store categorical data

Andrew Kane <andrew@chartkick.com> writes:

Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.

Right ...

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

... but you're not saying how you'd avoid the need for prior commit of the
labels. The sticking point for enums is that once a value has gotten into
a btree index, we can't ever lose the ability to compare that value to
others, or the index will be broken. So inserting an uncommitted value
into user tables has to be prevented.

Maybe there's a way to assign the labels so that they can be compared
without reference to any outside data, but it's not clear to me how
that would work.

regards, tom lane

#9Mark Dilger
hornschnorter@gmail.com
In reply to: Andrew Kane (#7)
Re: A space-efficient, user-friendly way to store categorical data

On Feb 12, 2018, at 5:08 PM, Andrew Kane <andrew@chartkick.com> wrote:

Thanks everyone for the feedback. The current enum implementation requires you to create a new type and add labels outside a transaction prior to an insert.

-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);

-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

What would be ideal:

-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);

-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;

Since enums have a fixed number of labels, this type of feature may be better off as a property you could add to text columns (as Thomas mentions). This would avoid issues with hitting the max number of labels.

In your proposed feature, what happens if I create two tables:

CREATE TABLE myusers (city dynamic_enum);
CREATE TABLE yourusers (city dynamic_enum);

Do you imagine that myusers and yourusers are referring to the
same enum or to two different enums? Are the enums stored in
a new table within pg_catalog, or are they stored in something akin
to a toast table? If you insert billions of rows into a table, but only
have 30 distinct values, can you quickly query for all 30 distinct enum
values, or would you have to walk billions of rows to find them all?

mark

#10Mark Dilger
hornschnorter@gmail.com
In reply to: Tom Lane (#8)
Re: A space-efficient, user-friendly way to store categorical data

On Feb 12, 2018, at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Kane <andrew@chartkick.com> writes:

Thanks everyone for the feedback. The current enum implementation requires
you to create a new type and add labels outside a transaction prior to an
insert.

Right ...

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of labels.

... but you're not saying how you'd avoid the need for prior commit of the
labels. The sticking point for enums is that once a value has gotten into
a btree index, we can't ever lose the ability to compare that value to
others, or the index will be broken. So inserting an uncommitted value
into user tables has to be prevented.

Maybe there's a way to assign the labels so that they can be compared
without reference to any outside data, but it's not clear to me how
that would work.

When I implemented this, I wrote the comparators to work on the Oid for
the value, not the string representation. That works fine. If you want to
sort the data on the stringified version, cast to text first. That works well
enough for me, since I'm typically not interested in what sort order is used,
as long as it is deterministic and works for indexing, group by, and so forth.

#11Andrew Kane
andrew@chartkick.com
In reply to: Mark Dilger (#10)
Re: A space-efficient, user-friendly way to store categorical data

They'd refer to separate enums.

I originally thought an enum was a good comparison for this feature, but
I'm no longer sure that it is. A text-based ordering would be desired
rather than the label index.

A better comparison may be a two-column lookup table:

-- create
CREATE TABLE cities (id bigserial primary key, name text)
CREATE UNIQUE INDEX ON cities (name);
CREATE TABLE users (city_id bigint);

-- insert
BEGIN;
INSERT INTO cities (name) VALUES ('Chicago') ON CONFLICT (name) DO NOTHING
RETURNING id;
INSERT INTO users (city_id) VALUES (<city id returned from earlier>);
COMMIT;

-- select
SELECT * FROM users FROM users INNER JOIN cities ON cities.id =
users.city_id WHERE name = 'Chicago';

Ideally, the lookup table could be maintained by Postgres to make reads and
writes easier.

-- create
CREATE TABLE users (city text DEDUPED);

-- insert
INSERT INTO users (city) VALUES ('Chicago');

-- query
SELECT * FROM users WHERE city = 'Chicago';

I'm not really sure the best place to store this lookup table.

- Andrew

On Mon, Feb 12, 2018 at 7:11 PM, Mark Dilger <hornschnorter@gmail.com>
wrote:

Show quoted text

On Feb 12, 2018, at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Kane <andrew@chartkick.com> writes:

Thanks everyone for the feedback. The current enum implementation

requires

you to create a new type and add labels outside a transaction prior to

an

insert.

Right ...

Since enums have a fixed number of labels, this type of feature may be
better off as a property you could add to text columns (as Thomas
mentions). This would avoid issues with hitting the max number of

labels.

... but you're not saying how you'd avoid the need for prior commit of

the

labels. The sticking point for enums is that once a value has gotten

into

a btree index, we can't ever lose the ability to compare that value to
others, or the index will be broken. So inserting an uncommitted value
into user tables has to be prevented.

Maybe there's a way to assign the labels so that they can be compared
without reference to any outside data, but it's not clear to me how
that would work.

When I implemented this, I wrote the comparators to work on the Oid for
the value, not the string representation. That works fine. If you want to
sort the data on the stringified version, cast to text first. That works
well
enough for me, since I'm typically not interested in what sort order is
used,
as long as it is deterministic and works for indexing, group by, and so
forth.

#12Andres Freund
andres@anarazel.de
In reply to: Andrew Dunstan (#3)
Re: A space-efficient, user-friendly way to store categorical data

Hi,

On 2018-02-12 09:54:29 +1030, Andrew Dunstan wrote:

The idea is to have an append-only list of labels
which would not obey transactional semantics, and would thus help us
avoid the pitfalls of enums - there wouldn't be any rollback of an
addition.

FWIW, I think we can resolve the issues of enum transactionality. While
not trivial I think the solution is to allow to make additions of enum
values non-transactional (by basically inserting them with immediate
visibility) *while* in a transaction, and to change deletion of values
to set a 'deleted' column to true.

With some additional effort we could even make additions transactional
by storing the xid that added the enum label along the row. Visibility
for index comparisons would regard labels as visible regardless of that
xid, but when creating new values we'd make a visibility check. The big
issue with that is that autovacuum has to know about it...

Greetings,

Andres Freund