On columnar storage

Started by Alvaro Herreraalmost 11 years ago35 messageshackers

alvherre@2ndquadrant.com

almost 11 years ago

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

There already are some third-party CS implementations for Postgres; some
of these work on top of the FDW interface, others are simply proprietary
forks. Since I don't have access to any of their code, it's not much I
can learn from them. If people with insider knowledge on them can chime
in, perhaps we can work together -- collaboration is very welcome.

We're not interested in perpetuating the idea that a CS needs to go
through the FDW mechanism. Even if there's a lot of simplicity of
implementation, it's almost certain to introduce too many limitations.

Simply switching all our code to use columnar storage rather than
row-based storage is unlikely to go well. We're aiming at letting some
columns of tables be part of a CS, while other parts would continue to
be in the heap. At the same time, we're aiming at opening the way for
different CS implementations instead of trying to provide a single
one-size-fits-all one.

There are several parts to this:

1. the CSM API
2. Cataloguing column stores
3. Query processing: rewriter, optimizer, executor

The Column Store Manager API
----------------------------

Since we want to have pluggable implementations, we need to have a
registry of store implementations. I propose we add a catalog
pg_cstore_impl with OID, name, and a bunch of function references to
"open" a store, "getvalue" from it, "getrows" (to which we pass a qual
and get a bunch of tuple IDs back), "putvalue".

This is in line with our procedural language support.

One critical detail is what will be used to identify a heap row when
talking to a CS implementation. There are two main possibilities:

1. use CTIDs
2. use some logical tuple identifier

Using CTIDs is simpler. One disadvantage is that every UPDATE of a row
needs to let the CS know about the new location of the tuple, so that
the value is known associated with the new tuple location as well as the
old. This needs to happen even if the value of the column itself is not
changed.

Using logical tuple identifiers solves this problem: an update does not
change the LTID, so the tuple colstore needn't be involved unless the
attribute(s) in the colstore is being changed. The downside is that the
logical tuple identifier must come from somewhere. We could either use
some user attribute, if there's something appropriate. But it's
probably not good to simply use any primary key that the user has
specified. (Also, having an UPDATE change the primary key would be
troublesome). We could also offer the choice of having an autogenerated
value that's not user-visible; we currently don't have non-user-visible
columns, so this would be additional implementation effort.
Furthermore, we should think about interactions between this and the
IDENTITY stuff we currently have for replication -- my impression is
that IDENTITY does not necessarily represent an useful identifier for
column store purposes.

All in all, it seems prudent to limit the initial implementation to use
CTIDs only, and leave LTIDs for a later stage.

Cataloguing Column Stores
-------------------------

Each table with columns in a separate store will have relhasstore=t.
This hints construction of its relcache entry to obtain rows from
pg_cstore for that table. The new catalog pg_cstore looks like this:

cstname | cststoreid | cstrelid | cstnatts | cstatts

cstname is the store name; unique within each relation.
cststoreid is the OID of the pg_cstore_impl row.
cstorerelid is the OID of the table that this cstore is for.
cstnatts is the number of columns in the store
cstatts is an array of attnums contained in this store.

This design permits having multiple stores for a table, and one or
more columns in a store. We will focus on the case that a table has a
single column store, and a column store has a single column, because we
believe this simplifies several things.

Query Processing
----------------

Rewriter

Parsing occurs as currently. During query rewrite, specifically at the
bottom of the per-relation loop in fireRIRrules(), we will modify the
query tree: each relation RTE containing a colstore will be replaced
with a JoinExpr containing the relation as left child and the colstore
as right child (1). The colstore RTE will be of a new RTEKind. For
each such change, all Var nodes that point to attnums stored in the
colstore will modified so that they reference the RTE of the colstore
instead (2).

(1) This seems very similar to what convert_ANY_sublink_to_join() does.

(2) This is very similar to ChangeVarNodes does, except that we modify
only some of the var nodes pointing to the relation, not all.

The addition of the new RTE will use a new function
addRangeTableEntryForColstore(). This seems a bit odd, because that's a
parse-time function, but I see that we're already using -ForSubquery in
convert_ANY_sublink_to_join() so it's probably okay.

Another thing worth mentioning is that we need to have the colstore and
the relation have a junk attribute which use to join them. This will be
the CTID for now, as indicated above.

Planner

I still haven't formed any idea of what needs to change in planner code
to handle column stores especially. So far I have the idea that we will
need to distinguish them from other RT kinds by using a new value
RELOPT_COLSTORE for RelOptKind. I think it's not okay to use
RELOPT_BASEREL, because a colstore is not a full-blown relation; and
"otherrel" also seems not appropriate because we do need them to appear
in the join tree, as described above, which current otherrels
(Append/MergeAppend children) do not. Some things such as inner join
removal should work on colstore RTEs just like on plain relations.

Unless the optimization problems are shown to be easily solvable, we
will probably disallow having column stores on inherited relations (esp.
partitioned relations). We can improve the optimizar later to allow for
this, after the basics are in place.

Executor

There are a few ideas on a couple of new executor nodes that will need
to operate on colstore RTEs. For example, we can delay accessing
columns in a store until we need to project them; if rows from the main
relation are filtered somewhere upstream, any columns in the store
needn't be accessed for the filtered rows. We have been using the name
LateColumnMaterialization for this.

A completely different consideration is offloading some query quals to a
lone column store scan, which can be combined using BitmapOr or
BitmapAnd with other quals on the relation; the idea is something very
similar to IndexHeapScan in that it takes a qual and produces a list of
TIDs. We're calling this BitmapColumnScan.

The research leading to these results has received funding from the
European Union’s Seventh Framework Programme (FP7/2007-2015) under grant
agreement n° 318633.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Qingqing Zhou

zhouqq@cs.toronto.edu

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On Thu, Jun 11, 2015 at 4:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

And better compression ratio.

We're not interested in perpetuating the idea that a CS needs to go
through the FDW mechanism.

Agree. It is cleaner to add a ColumnScan node which does a scan
against a columnar table, and a possible ColumnIndexScan for an
indexed columnar table seek.

Since we want to have pluggable implementations, we need to have a
registry of store implementations.

If we do real native implementation, where columnar store sits on par
with heap, can give us arbitray flexibility to control performance and
transaction, without worrying about interface (you defined below)
compatibility.

One critical detail is what will be used to identify a heap row when
talking to a CS implementation. There are two main possibilities:

1. use CTIDs
2. use some logical tuple identifier

I like the concept of half row, half columnar table: this allows row
part good for select * and updates, and columnar part for other
purpose. Popular columnar-only table uses position alignment, which is
virtual (no storage), to associate each column value. CTIDs are still
needed but not for this purpose. An alternaive is:
1. Allow column groups, where several columns physically stored together;
2. Updates are handled by a separate row store table associated with
each columnar table.

Query Processing
----------------

If we treat columnar storage as first class citizen as heap, we can
model after heap, which enables much natural change in parser,
rewriter, planner and executor.

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On 06/11/2015 04:03 PM, Alvaro Herrera wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

Added to:
https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Unconference#Topics

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM70e086a0a7221e729bab7d127a62dc8671630c8e450078ce825f61149c732af13676d54f5dbf1508f525a8b3a7e4196d@asav-3.01.com

Stephen Frost

sfrost@snowman.net

almost 11 years ago

In reply to: Josh Berkus (#3)

Re: On columnar storage

Josh,

* Josh Berkus (josh@agliodbs.com) wrote:

On 06/11/2015 04:03 PM, Alvaro Herrera wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

Added to:
https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Unconference#Topics

I believe it was already there?

Look for 'Native Columnar Storage'.

Thanks!

Stephen

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

There already are some third-party CS implementations for Postgres; some
of these work on top of the FDW interface, others are simply proprietary
forks. Since I don't have access to any of their code, it's not much I
can learn from them. If people with insider knowledge on them can chime
in, perhaps we can work together -- collaboration is very welcome.

We're not interested in perpetuating the idea that a CS needs to go
through the FDW mechanism. Even if there's a lot of simplicity of
implementation, it's almost certain to introduce too many limitations.

Simply switching all our code to use columnar storage rather than
row-based storage is unlikely to go well. We're aiming at letting some
columns of tables be part of a CS, while other parts would continue to
be in the heap. At the same time, we're aiming at opening the way for
different CS implementations instead of trying to provide a single
one-size-fits-all one.

There are several parts to this:

1. the CSM API
2. Cataloguing column stores
3. Query processing: rewriter, optimizer, executor

I think another important point is about the format of column stores, in
Page format used by index/heap and how are they organised?

The Column Store Manager API
----------------------------

Since we want to have pluggable implementations, we need to have a
registry of store implementations. I propose we add a catalog
pg_cstore_impl with OID, name, and a bunch of function references to
"open" a store, "getvalue" from it, "getrows" (to which we pass a qual
and get a bunch of tuple IDs back), "putvalue".

This is in line with our procedural language support.

One critical detail is what will be used to identify a heap row when
talking to a CS implementation. There are two main possibilities:

1. use CTIDs
2. use some logical tuple identifier

Using CTIDs is simpler. One disadvantage is that every UPDATE of a row
needs to let the CS know about the new location of the tuple, so that
the value is known associated with the new tuple location as well as the
old. This needs to happen even if the value of the column itself is not
changed.

Isn't this somewhat similar to index segment?
Will the column store obey snapshot model similar to current heap tuples,
if so will it derive the transaction information from heap tuple?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

To me, it feels like there are two different features here that would
be better separated. First, there's the idea of having a table that
gets auto-joined to other tables whenever you access it, so that the
user sees one really wide table but really the data is segregated by
column groups under the hood. That's a neat idea. Second, there's
the idea of a way of storing tuples that is different from
PostgreSQL's usual mechanism - i.e. a storage manager API. I
understand your concerns about going through the FDW API so maybe
that's not the right way to do it, but it seems to me that in the end
you are going to end up with something awfully similar to that + a
local relfilenode + WAL logging support. I'm not clear on why you
want to make the column store API totally different from the FDW API;
there may be a reason, but I don't immediately see it.

Each of these two features is independently useful. If you just had
the first feature, you could use the existing table format as your
columnar store. I'm sure it's possible to do a lot better in some
cases, but there could easily be cases where that's a really big win,
because the existing table format has far more sophisticated indexing
capabilities than any columnar store is likely to have in an early
version. The second capability, of course, opens up all sorts of
interesting possibilities, like compressed read-only storage or
index-organized tables. And it would also let you have an
"all-columnar" table, similar to what Citus's cstore_fdw does, which
doesn't seem like it would be supported in your design, and could be a
pretty big win for certain kinds of tables.

BTW, I'm not sure if it's a good idea to call all of this stuff
"cstore". The name is intuitive, but cstore_fdw has enough traction
already that it may create some confusion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Merlin Moncure

mmoncure@gmail.com

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On Thu, Jun 11, 2015 at 6:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

Quick thought. We already support out of line columnar storage in a
fashion with 'toast'. Obviously that's a long way from where you want
to go, but have you ruled out extending that system?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Amit Kapila (#5)

Re: On columnar storage

Amit Kapila wrote:

On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

There are several parts to this:

1. the CSM API
2. Cataloguing column stores
3. Query processing: rewriter, optimizer, executor

I think another important point is about the format of column stores, in
Page format used by index/heap and how are they organised?

Not really. That stuff is part of the column store implementation
itself; we're not tackling that part just yet. Eventually there might
be an implementation using ORC or other formats. That doesn't matter at
this point -- we only need something that implements the specified API.

One critical detail is what will be used to identify a heap row when
talking to a CS implementation. There are two main possibilities:

1. use CTIDs
2. use some logical tuple identifier

Using CTIDs is simpler. One disadvantage is that every UPDATE of a row
needs to let the CS know about the new location of the tuple, so that
the value is known associated with the new tuple location as well as the
old. This needs to happen even if the value of the column itself is not
changed.

Isn't this somewhat similar to index segment?

Not sure what you mean with "index segment". A column store is not an
index -- it is the primary storage for the column in question. The heap
does not have a copy of the data.

Will the column store obey snapshot model similar to current heap tuples,
if so will it derive the transaction information from heap tuple?

Yes, visibility will be tied to the heap tuple -- a value is accessed
only when its corresponding heap row has already been determined to be
visible. One interesting point that raises from this is about vacuum:
when are we able to remove a value from the store? I have some
not-completely-formed ideas about this.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Alvaro Herrera (#8)

Re: On columnar storage

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Amit Kapila wrote:

Will the column store obey snapshot model similar to current heap tuples,
if so will it derive the transaction information from heap tuple?

Yes, visibility will be tied to the heap tuple -- a value is accessed
only when its corresponding heap row has already been determined to be
visible. One interesting point that raises from this is about vacuum:
when are we able to remove a value from the store? I have some
not-completely-formed ideas about this.

Hm. This seems not terribly ambitious --- mightn't a column store
extension wish to store tables *entirely* in the column store, rather
than tying them to a perhaps-vestigial heap table? Perhaps that's a
necessary restriction to get to something implementable, but considering
that the proposal mentions pluggable column stores I should think you'd
not want to tie it down that much.

I can't help thinking that this could tie in with the storage level API
that I was waving my arms about last year. Or maybe not --- the goals
are substantially different --- but I think we ought to reflect on that
rather than just doing a narrow hack for column stores used in the
particular way you're describing here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Merlin Moncure (#7)

Re: On columnar storage

Hi,

On 06/12/15 15:56, Merlin Moncure wrote:

On Thu, Jun 11, 2015 at 6:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

We hope to have a chance to discuss this during the upcoming
developer unconference in Ottawa. Here are some preliminary ideas
to shed some light on what we're trying to do.

Quick thought. We already support out of line columnar storage in a
fashion with 'toast'. Obviously that's a long way from where you want
to go, but have you ruled out extending that system?

I doubt it would be a simple extension of TOAST, because it's a bit
inverse to the columnar store. TOAST splits a single value into many
pieces, but colstore needs to store values from multiple rows together.
I wouldn't really call TOAST columnar storage.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Merlin Moncure (#7)

Re: On columnar storage

Merlin Moncure wrote:

On Thu, Jun 11, 2015 at 6:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

Quick thought. We already support out of line columnar storage in a
fashion with 'toast'. Obviously that's a long way from where you want
to go, but have you ruled out extending that system?

TOAST uses pointers in the heap row. A toasted column is still present
in the heap -- there's no way to get across the 1600-column limitation
if we rely on anything stored in the heap. Hence the design of the
feature at hand is that the columnar storage somehow "points" to the
heap, rather than the other way around.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Michael Nolan

htfoot@gmail.com

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

Are you looking to avoid all hardware-based limits, or would using a 64

bit row pointer be possible? That would give you 2^64 or 1.8 E19 unique
rows over whatever granularity/uniqueness you use (per table, per database,
etc.)
--
Mike Nolan.

#13

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Tom Lane (#9)

Re: On columnar storage

Tom Lane wrote:

I can't help thinking that this could tie in with the storage level API
that I was waving my arms about last year. Or maybe not --- the goals
are substantially different --- but I think we ought to reflect on that
rather than just doing a narrow hack for column stores used in the
particular way you're describing here.

I can't seem to remember this proposal you mention. Care to be more
specific? Perhaps a link to archives is enough.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Alvaro Herrera (#13)

Re: On columnar storage

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

I can't help thinking that this could tie in with the storage level API
that I was waving my arms about last year. Or maybe not --- the goals
are substantially different --- but I think we ought to reflect on that
rather than just doing a narrow hack for column stores used in the
particular way you're describing here.

I can't seem to remember this proposal you mention. Care to be more
specific? Perhaps a link to archives is enough.

I never got to the point of having a concrete proposal, but there was a
discussion about it at last year's PGCon unconference; were you there?

Anyway the idea was to try to cut a clearer line between heap storage
and the upper levels of the system, particularly the catalog/DDL code
that we have so much of. Based on Salesforce's experience so far,
it's near impossible to get rid of HeapTuple as the lingua franca for
representing rows in the upper system levels, so we've not really tried;
but it would be nice if the DDL code weren't so much in bed with
heap-specific knowledge, like the wired-into-many-places assumption that
row insert and update actions require index updates but deletions don't.
We're also not very happy with the general assumption that a TID is an
adequate row identifier (as our storage engine does not have TIDs), so
I'm a bit disappointed to see you doubling down on that restriction
rather than trying to lift it.

Now much of this pain only comes into play if one is trying to change
the underlying storage format for system catalogs, which I gather is
not considered in your proposal. But if you want one format for catalogs
and another for user tables then you have issues like how do you guarantee
atomic commit and crash safety across multiple storage engines. That way
lies a mess, especially if you're trying to keep the engines at arms'
length which is what a pluggable architecture implies. MySQL is a
cautionary example we should be keeping in mind while thinking about this.

I don't really have any answers in this area, just questions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Tom Lane (#14)

Re: On columnar storage

Tom Lane wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Amit Kapila wrote:

Will the column store obey snapshot model similar to current heap tuples,
if so will it derive the transaction information from heap tuple?

Yes, visibility will be tied to the heap tuple -- a value is accessed
only when its corresponding heap row has already been determined to be
visible. One interesting point that raises from this is about vacuum:
when are we able to remove a value from the store? I have some
not-completely-formed ideas about this.

Hm. This seems not terribly ambitious --- mightn't a column store
extension wish to store tables *entirely* in the column store, rather
than tying them to a perhaps-vestigial heap table?

Well, yes, it might, but that opens a huge can of worms. What heapam
offers is not just tuple storage, but a lot of functionality on top of
that -- in particular, tuple locking and visibility. I am certainly not
considering re-implementing any of that. We might eventually go there,
but we will *additionally* need different implementations of those
things, and I'm pretty sure that will be painful, so I'm trying to stay
away from that.

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

I can't help thinking that this could tie in with the storage level API
that I was waving my arms about last year. Or maybe not --- the goals
are substantially different --- but I think we ought to reflect on that
rather than just doing a narrow hack for column stores used in the
particular way you're describing here.

I can't seem to remember this proposal you mention. Care to be more
specific? Perhaps a link to archives is enough.

I never got to the point of having a concrete proposal, but there was a
discussion about it at last year's PGCon unconference; were you there?

No, regrettably I wasn't there.

Anyway the idea was to try to cut a clearer line between heap storage
and the upper levels of the system, particularly the catalog/DDL code
that we have so much of. Based on Salesforce's experience so far,
it's near impossible to get rid of HeapTuple as the lingua franca for
representing rows in the upper system levels, so we've not really tried;
but it would be nice if the DDL code weren't so much in bed with
heap-specific knowledge, like the wired-into-many-places assumption that
row insert and update actions require index updates but deletions don't.

Agreed on both counts. As far as catalog code goes, removing direct
mapping from HeapTuple to C structs would require a huge rewrite of tons
of code. Unless we're considering rewriting small pieces of specific
catalog handling at a time, it seems unlikely that we will have columns
from system catalogs in column stores. (It seems reasonable that as
soon as we have column stores, we can have particular catalog code to
work with columnar storage, but I don't think there's much need for that
currently.)

I agree with your second point also --- it might be good to have a layer
in between, and it seems not completely unreasonable. It would require
touching lots of places but not hugely transforming things. (I think
it's not in the scope of the things I'm currently after, though.)

We're also not very happy with the general assumption that a TID is an
adequate row identifier (as our storage engine does not have TIDs), so
I'm a bit disappointed to see you doubling down on that restriction
rather than trying to lift it.

Well, in the general design, there is room for different tuple
identifiers. I'm just not implementing it for the first version.

Now much of this pain only comes into play if one is trying to change
the underlying storage format for system catalogs, which I gather is
not considered in your proposal. But if you want one format for catalogs
and another for user tables then you have issues like how do you guarantee
atomic commit and crash safety across multiple storage engines. That way
lies a mess, especially if you're trying to keep the engines at arms'
length which is what a pluggable architecture implies. MySQL is a
cautionary example we should be keeping in mind while thinking about this.

Right. I don't want a separate "storage engine" that needs to
reimplement transactions (as is the case in MySQL), or visibility rules.
I don't want to have a different format for tables or catalogs; both
would still be based on the current heapam API. I simply want to extend
the API so that I can have some columns in a separate place.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 16527.1434210072@sss.pgh.pa.us26723.1434131971@sss.pgh.pa.us | Resolved by subject fallback

#16

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#6)

Re: On columnar storage

Robert Haas wrote:

On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

To me, it feels like there are two different features here that would
be better separated. First, there's the idea of having a table that
gets auto-joined to other tables whenever you access it, so that the
user sees one really wide table but really the data is segregated by
column groups under the hood. That's a neat idea.

Thanks. (It also seems pretty tricky to implement.)

Second, there's the idea of a way of storing tuples that is different
from PostgreSQL's usual mechanism - i.e. a storage manager API. I
understand your concerns about going through the FDW API so maybe
that's not the right way to do it, but it seems to me that in the end
you are going to end up with something awfully similar to that + a
local relfilenode + WAL logging support. I'm not clear on why you
want to make the column store API totally different from the FDW API;
there may be a reason, but I don't immediately see it.

I just don't see that the FDW API is such a good fit for what I'm trying
to do. Anything using the FDW API needs to implement its own visibility
checking, for instance. I want to avoid that, because it's its own
complex problem. Also, it doesn't look like the FDW API supports things
that I want to do (neither my proposed LateColumnMaterialization nor my
proposed BitmapColumnScan). I would have to extend the FDW API, then
contort my stuff so that it fits in the existing API; then I will need
to make sure that existing FDWs are not broken by the changes I would
propose elsewhere. Round peg, square hole is all I see here. All in
all, this seems too much additional work, just to make to things that
are really addressing different problems go through the same code.

You're correct about "local WAL logging". We will need a solution to
that problem. I was hoping to defer that until we had something like
Alexander Korotkov's proposed pluggable WAL stuff.

Each of these two features is independently useful. If you just had
the first feature, you could use the existing table format as your
columnar store. I'm sure it's possible to do a lot better in some
cases, but there could easily be cases where that's a really big win,
because the existing table format has far more sophisticated indexing
capabilities than any columnar store is likely to have in an early
version.

Yeah, sure, it's pretty likely that the first experimental colstore
implementation will just be based on existing infrastructure.

The second capability, of course, opens up all sorts of
interesting possibilities, like compressed read-only storage or
index-organized tables. And it would also let you have an
"all-columnar" table, similar to what Citus's cstore_fdw does, which
doesn't seem like it would be supported in your design, and could be a
pretty big win for certain kinds of tables.

Well, I would like to know about those use cases that Citus stuff is
good at, so that we can make sure they work reasonably under my proposed
design. Maybe I have to require vacuuming so that the all-visible bits
are set, so that a column scan can skip visibility checking for most of
the underlying heap tuples to produce a large aggregation report. That
seems pretty reasonable to me.

BTW, I'm not sure if it's a good idea to call all of this stuff
"cstore". The name is intuitive, but cstore_fdw has enough traction
already that it may create some confusion.

I'm not thinking of calling anything user-visible with the name
"cstore". It's just a development term. Surely we're not reserving
names from what is used in third party code.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Michael Nolan (#12)

Re: On columnar storage

Hi,

On 06/13/15 00:07, Michael Nolan wrote:

On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com <mailto:alvherre@2ndquadrant.com>> wrote:

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

Are you looking to avoid all hardware-based limits, or would using a 64
bit row pointer be possible? That would give you 2^64 or 1.8 E19 unique
rows over whatever granularity/uniqueness you use (per table, per
database, etc.)
--
Mike Nolan.

I don't think the number of tuples is the main problem here, it's the
number of pages a single relation can have. Looking at the numbers of
rows as a direct function of TID size is misleading, because the TID is
split into two fixed parts - page number (32b) and tuple number (16b).

For the record, 2^48 is 281,474,976,710,656 which ought to be enough for
anybody, but we waste large part of that because we assume there might
be up to 2^16 tuples per page, although the actual limit is way lower
(~290 for 8kB pages, and ~1200 for 32kB pages.

So we can only have ~4 billion pages, which is where the 32TB limit
comes from (with 32kB pages it's 128TB).

Longer TIDs are one a straightforward way to work around this limit,
assuming you add the bits to the 'page number' field. Adding 16 bits
(thus using 64-bit pointers) would increase the limit 2^16-times to
about 2048 petabytes (with 8kB pages). But that of course comes with a
cost, because you have to keep those larger TIDs in indexes etc.

Another option might be to split the 48 bits differently, by moving 5
bits to the page number part of TID (so that we expect ~2048 tuples per
page at most). That'd increase the limit to 1PB (4PB with 32kB pages).

The column store approach is somehow orthogonal to this, because it
splits the table vertically into multiple pieces, each stored in a
separate relfilenode and thus using a separate sequence of page numbers.

And of course, the usual 'horizontal' partitioning has a very similar
effect (separate filenodes).

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Alvaro Herrera (#16)

Re: On columnar storage

On 06/14/15 17:22, Alvaro Herrera wrote:

Robert Haas wrote:

,,,

Second, there's the idea of a way of storing tuples that is different
from PostgreSQL's usual mechanism - i.e. a storage manager API. I
understand your concerns about going through the FDW API so maybe
that's not the right way to do it, but it seems to me that in the end
you are going to end up with something awfully similar to that + a
local relfilenode + WAL logging support. I'm not clear on why you
want to make the column store API totally different from the FDW API;
there may be a reason, but I don't immediately see it.

I just don't see that the FDW API is such a good fit for what I'm trying
to do. Anything using the FDW API needs to implement its own visibility
checking, for instance. I want to avoid that, because it's its own
complex problem. Also, it doesn't look like the FDW API supports things
that I want to do (neither my proposed LateColumnMaterialization nor my
proposed BitmapColumnScan). I would have to extend the FDW API, then
contort my stuff so that it fits in the existing API; then I will need
to make sure that existing FDWs are not broken by the changes I would
propose elsewhere. Round peg, square hole is all I see here. All in
all, this seems too much additional work, just to make to things that
are really addressing different problems go through the same code.

You're correct about "local WAL logging". We will need a solution to
that problem. I was hoping to defer that until we had something like
Alexander Korotkov's proposed pluggable WAL stuff.

Probably worth mentioning we don't expect the column store API to be
used only by extensions, but even by code from within the core which can
use the current WAL infrastructure etc.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Petr Jelinek

petr@2ndquadrant.com

almost 11 years ago

In reply to: Tomas Vondra (#18)

Re: On columnar storage

On 2015-06-14 17:36, Tomas Vondra wrote:

On 06/14/15 17:22, Alvaro Herrera wrote:

Robert Haas wrote:

,,,

Second, there's the idea of a way of storing tuples that is different
from PostgreSQL's usual mechanism - i.e. a storage manager API. I
understand your concerns about going through the FDW API so maybe
that's not the right way to do it, but it seems to me that in the end
you are going to end up with something awfully similar to that + a
local relfilenode + WAL logging support. I'm not clear on why you
want to make the column store API totally different from the FDW API;
there may be a reason, but I don't immediately see it.

I just don't see that the FDW API is such a good fit for what I'm trying
to do. Anything using the FDW API needs to implement its own visibility
checking, for instance. I want to avoid that, because it's its own
complex problem. Also, it doesn't look like the FDW API supports things
that I want to do (neither my proposed LateColumnMaterialization nor my
proposed BitmapColumnScan). I would have to extend the FDW API, then
contort my stuff so that it fits in the existing API; then I will need
to make sure that existing FDWs are not broken by the changes I would
propose elsewhere. Round peg, square hole is all I see here. All in
all, this seems too much additional work, just to make to things that
are really addressing different problems go through the same code.

You're correct about "local WAL logging". We will need a solution to
that problem. I was hoping to defer that until we had something like
Alexander Korotkov's proposed pluggable WAL stuff.

Probably worth mentioning we don't expect the column store API to be
used only by extensions, but even by code from within the core which can
use the current WAL infrastructure etc.

I think it would be even ok if the initial implementation had
restrictions in a way that extensions can't use it. I think we want
implementation in the core first, support for extensions can come later.
But it's still worth it to have an API since single implementation
probably won't fit every purpose (see indexes).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andres Freund

andres@anarazel.de

almost 11 years ago

In reply to: Alvaro Herrera (#1)

Re: On columnar storage

On 2015-06-11 20:03:16 -0300, Alvaro Herrera wrote:

Rewriter

Parsing occurs as currently. During query rewrite, specifically at the
bottom of the per-relation loop in fireRIRrules(), we will modify the
query tree: each relation RTE containing a colstore will be replaced
with a JoinExpr containing the relation as left child and the colstore
as right child (1). The colstore RTE will be of a new RTEKind. For
each such change, all Var nodes that point to attnums stored in the
colstore will modified so that they reference the RTE of the colstore
instead (2).

FWIW, I think this is a pretty bad place to tackle this. For one I think
we shouldn't add more stuff using the rewriter unless it's clearly the
best interface. For another, doing things in the rewriter will make
optimizing things much harder - the planner will have to reconstruct
knowledge which of the joins are column store joins and such.

Why do you want to do things there?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Andres Freund (#20)

#22

Tom Lane

tgl@sss.pgh.pa.us

almost 11 years ago

In reply to: Tom Lane (#21)

#23

Alvaro Herrera