Zedstore - compressed in-core columnar storage

Started by Ashwin Agrawalabout 7 years ago123 messageshackers
Jump to latest
#1Ashwin Agrawal
aagrawal@pivotal.io

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.

Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.

Motivations / Objectives

* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.

High-level Design - B-trees for the win!
========================================

To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.

The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.

TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.

The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:

- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload

+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | uncompressed size | lastTID | payload (container
item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------

Row store
---------

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.

In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.

The buffer cache caches compressed blocks. Likewise, WAL-logging,
full-page images etc. work on compressed blocks. Uncompression is done
on-the-fly, as and when needed in backend-private memory, when
reading. For some compressions like rel encoding or delta encoding
tuples can be constructed directly from compressed data.

Column store
------------

A column store uses the same structure but we have *multiple* B-trees,
one for each column, all indexed by TID. The B-trees for all columns
are stored in the same physical file.

A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.

So, in summary can imagine Zedstore as forest of B-trees, one for each
column, all indexed by TIDs.

This way of laying out the data also easily allows for hybrid
row-column store, where some columns are stored together, and others
have a dedicated B-tree. Need to have user facing syntax to allow
specifying how to group the columns.

Main reasons for storing data this way
--------------------------------------

* Layout the data/tuples in mapped fashion instead of keeping the
logical to physical mapping separate from actual data. So, keep the
meta-data and data logically in single stream of file, avoiding the
need for separate forks/files to store meta-data and data.

* Stick to fixed size physical blocks. Variable size blocks pose need
for increased logical to physical mapping maintenance, plus
restrictions on concurrency of writes and reads to files. Hence
adopt compression to fit fixed size blocks instead of other way
round.

MVCC
----
MVCC works very similar to zheap for zedstore. Undo record pointers
are used to implement MVCC. Transaction information if not directly
stored with the data. In zheap, there's a small, fixed, number of
"transaction slots" on each page, but zedstore has undo pointer with
each item directly; in normal cases, the compression squeezes this
down to almost nothing.

Implementation
==============

Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.

Compression:
Items are added to Btree in uncompressed form. If page is full and new
item can't be added, compression kicks in. Existing uncompressed items
(plain items) of the page are passed to compressor for
compression. Already compressed items are added back as is. Page is
rewritten with compressed data with new item added to it. If even
after compression, can't add item to page, then page split happens.

Toast:
When an overly large datum is stored, it is divided into chunks, and
each chunk is stored on a dedicated toast page within the same
physical file. The toast pages of a datum form list, each page has a
next/prev pointer.

Select:
Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.

Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.

* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.

* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.

Index Support:
Building index also leverages columnar storage and only scans columns
required to build the index. Indexes work pretty similar to heap
tables. Data is inserted into tables and TID for the tuple same gets
stored in index. On index scans, required column Btrees are scanned
for given TID and datums passed back using virtual tuple.

Page Format:
ZedStore table contains different kinds of pages, all in the same
file. Kinds of pages are meta-page, per-attribute btree internal and
leaf pages, UNDO log page, and toast pages. Each page type has its own
distinct data storage format.

Block 0 is always a metapage. It contains the block numbers of the
other data structures stored within the file, like the per-attribute
B-trees, and the UNDO log.

Enhancements to design:
=======================

Instead of compressing all the tuples on a page in one batch, we could
store a small "dictionary", e.g. in page header or meta-page, and use
it to compress each tuple separately. That could make random reads and
updates of individual tuples faster.

When adding column, just need to create new Btree for newly added
column and linked to meta-page. No existing content needs to be
rewritten.

When the column is dropped, can scan the B-tree of that column, and
immediately mark all the pages as free in the FSM. But we don't
actually have to scan the leaf level: all leaf tuples have a downlink
in the parent, so we can scan just the internal pages. Unless the
column is very wide, that's only a small fraction of the data. That
makes the space immediately reusable for new insertions, but it won't
return the space to the Operating System. In order to do that, we'd
still need to defragment, moving pages from the end of the file closer
to the beginning, and truncate the file.

In this design, we only cache compressed pages in the page cache. If
we want to cache uncompressed pages instead, or in addition to that,
we need to invent a whole new kind of a buffer cache that can deal
with the variable-size blocks.

If you do a lot of updates, the file can get fragmented, with lots of
unused space on pages. Losing the correlation between TIDs and
physical order is also bad, because it will make SeqScans slow, as
they're not actually doing sequential I/O anymore. We can write a
defragmenter to fix things up. Half-empty pages can be merged, and
pages can be moved to restore TID/physical correlation. This format
doesn't have the same MVCC problems with moving tuples around that the
Postgres heap does, so it can be fairly easily be done on-line.

Min-Max values can be stored for block to easily skip scanning if
column values doesn't fall in range.

Notes about current patch
=========================

Basic (core) functionality is implemented to showcase and play with.

Two compression algorithms are supported Postgres pg_lzcompress and
lz4. Compiling server with --with-lz4 enables the LZ4 compression for
zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
at compressing and uncompressing.

Not all the table AM API's are implemented. For the functionality not
implmented yet will ERROR out with not supported. Zedstore Table can
be created using command:

CREATE TABLE <name> (column listing) USING zedstore;

Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
DELETES work. Btree indexes can be created. Btree and bitmap index
scans work. Test in src/test/regress/sql/zedstore.sql showcases all
the functionality working currently. Updates are currently implemented
as cold, means always creates new items and not performed in-place.

TIDs currently can't leverage the full 48 bit range but instead need
to limit to values which are considered valid ItemPointers. Also,
MaxHeapTuplesPerPage pose restrictions on the values currently it can
have. Refer [7] for the same.

Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.

Helpful functions to find how many pages of each type is present in
zedstore table and also to find compression ratio is provided.

Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows

postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)

Time: 4679.026 ms (00:04.679)

postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)

Time: 379.710 ms

Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.

The current proposal and discussion is more focused on AM layer work
first. Hence, currently intentionally skips to discuss the planner or
executor "feature" enhancements like adding vectorized execution and
family of features.

Previous discussions or implementations for column store Vertical
cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
were refered to distill down objectives and come up with design and
implementations to avoid any earlier concerns raised. Learnings from
Greenplum Database column store also leveraged while designing and
implementing the same.

Credit: Design is moslty brain child of Heikki, or actually his
epiphany to be exact. I acted as idea bouncing board and contributed
enhancements to the same. We both are having lot of fun writing the
code for this.

References
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
/messages/by-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com
3]
/messages/by-id/20150611230316.GM133018@postgresql.org
4]
/messages/by-id/20150831225328.GM2912@alvherre.pgsql
5] https://github.com/citusdata/cstore_fdw
6]
/messages/by-id/CAOykqKfko-n5YiBJtk-ocVdp+j92Apu5MJBwbGGh4awRY5NCuQ@mail.gmail.com
7]
/messages/by-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43@iki.fi

Attachments:

v1-0001-Zedstore-compressed-in-core-columnar-storage.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Zedstore-compressed-in-core-columnar-storage.patchDownload+7045-13
#2Andres Freund
andres@anarazel.de
In reply to: Ashwin Agrawal (#1)
Re: Zedstore - compressed in-core columnar storage

Hi,

On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method).

That's very cool.

Motivations / Objectives

* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.

Is storage going through the bufmgr.c or separately?

In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.

When does compression happen? After each modifcation of the expanded
"page"? Are repeated expansions prevented somehow, e.g. if I
insert/delete rows into the same page one-by-one?

A metapage at block 0, has links to the roots of the B-trees. Leaf
pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.

Does the size of the metapage limit the number of column [groups]? Or is
there some overflow / tree of trees / whatnot happening?

Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.

Is there some buffering? Without that it seems like retail inserts are
going to be pretty slow?

Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.

Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.

* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.

* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.

FWIW, I don't quite think this is the right approach. I've only a vague
sketch of this in my head, but I think we should want a general API to
pass that down to *any* scan. Even for heap, not deforming leading
columns that a uninteresting, but precede relevant columns, would be
quite a noticable performance win. I don't think the projection list is
the right approach for that.

Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.

Have you looked at the undo APIs developed for zheap, as discussed on
the list? Seems important that they're suitable for this too.

Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows

postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)

Time: 4679.026 ms (00:04.679)

postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
zedstore
avg
---------------------
500000.500000000000
(1 row)

Time: 379.710 ms

Well, I'm not sure I'm actually impressed by that. What does the
performance look like if you select i0 instead?

Important note:
---------------
Planner has not been modified yet to leverage the columnar
storage. Hence, plans using "physical tlist" optimization or such good
for row store miss out to leverage the columnar nature
currently. Hence, can see the need for subquery with OFFSET 0 above to
disable the optimization and scan only required column.

I'm more and more thinking that we should just nix the physical tlist
stuff and start afresh.

Congrats again, this is cool stuff.

- Andres

#3Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Andres Freund (#2)
Re: Zedstore - compressed in-core columnar storage

On Mon, Apr 8, 2019 at 6:04 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method).

That's very cool.

Motivations / Objectives

* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.

Is storage going through the bufmgr.c or separately?

Yes, below access method its pretty much same as heap. All reads and writes
flow via buffer cache. The implementation sits nicely in between, just
modifying the access method code changing how just how data is stored in
pages, above AM and below AM is basically all behaves similar to heap code.

In uncompressed form, the page can be arbitrarily large. But after
compression, it must fit into a physical 8k block. If on insert or
update of a tuple, the page cannot be compressed below 8k anymore, the
page is split. Note that because TIDs are logical rather than physical
identifiers, we can freely move tuples from one physical page to
another during page split. A tuple's TID never changes.

When does compression happen? After each modifcation of the expanded

"page"? Are repeated expansions prevented somehow, e.g. if I

insert/delete rows into the same page one-by-one?

Compression is performed with new data is only if page becomes full, till
then uncompressed data is added to the page. If even after compression
cannot add data to the page then page split is performed. Already
compressed data is not compressed again on next insert. New compressed
block is created for newly added uncompressed items.

The line of thought we have for delete is will not free the space as soon
as delete is performed, but instead delay and reuse the space deleted on
next insertion to the page.

A metapage at block 0, has links to the roots of the B-trees. Leaf

pages look the same, but instead of storing the whole tuple, stores
just a single attribute. To reconstruct a row with given TID, scan
descends down the B-trees for all the columns using that TID, and
fetches all attributes. Likewise, a sequential scan walks all the
B-trees in lockstep.

Does the size of the metapage limit the number of column [groups]? Or is
there some overflow / tree of trees / whatnot happening?

In design it doesn't limit the number of columns, as can have chain of
meta-pages to store the required meta-data, page 0 still being start of the
chain.

Insert:
Inserting a new row, splits the row into datums. Then for first column
decide which block to insert the same to, and pick a TID for it, and
write undo record for the same. Rest of the columns are inserted using
that same TID and point to same undo position.

Is there some buffering? Without that it seems like retail inserts are
going to be pretty slow?

Yes, regular buffer cache.

Property is added to Table AM to convey if column projection is
leveraged by AM for scans. While scanning tables with AM leveraging
this property, executor parses the plan. Leverages the target list and
quals to find the required columns for query. This list is passed down
to AM on beginscan. Zedstore uses this column projection list to only
pull data from selected columns. Virtual tuple table slot is used to
pass back the datums for subset of columns.

Current table am API requires enhancement here to pass down column
projection to AM. The patch showcases two different ways for the same.

* For sequential scans added new beginscan_with_column_projection()
API. Executor checks AM property and if it leverages column
projection uses this new API else normal beginscan() API.

* For index scans instead of modifying the begin scan API, added new
API to specifically pass column projection list after calling begin
scan to populate the scan descriptor but before fetching the tuples.

FWIW, I don't quite think this is the right approach. I've only a vague
sketch of this in my head, but I think we should want a general API to
pass that down to *any* scan. Even for heap, not deforming leading
columns that a uninteresting, but precede relevant columns, would be
quite a noticable performance win. I don't think the projection list is
the right approach for that.

Sure, would love to hear more on it and can enhance the same as makes more
usable for AMs.

Extremely basic UNDO logging has be implemented just for MVCC
perspective. MVCC is missing tuple lock right now. Plus, doesn't
actually perform any undo yet. No WAL logging exist currently hence
its not crash safe either.

Have you looked at the undo APIs developed for zheap, as discussed on
the list? Seems important that they're suitable for this too.

Not in details yet, but yes plan is to leverage the same common framework
and undo log API as zheap. Will look into the details. With the current
zedstore implementation the requirements from the undo are prertty clear.

Test mentioned in thread "Column lookup in a row performance" [6],
good example query for zedstore locally on laptop using lz4 shows

postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
heap
avg
---------------------
500000.500000000000
(1 row)

Time: 4679.026 ms (00:04.679)

postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x;

--

zedstore
avg
---------------------
500000.500000000000
(1 row)

Time: 379.710 ms

Well, I'm not sure I'm actually impressed by that. What does the
performance look like if you select i0 instead?

Just for quick test used 100 instead of 200 columns (with 200 the results
would be more diverged), this is what it reports

postgres=# SELECT AVG(i0) FROM (select i0 from layout offset 0) x; -- heap
avg
------------------------
1.00000000000000000000
(1 row)

Time: 183.865 ms
postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; --
zedstore
avg
------------------------
1.00000000000000000000
(1 row)

Time: 47.624 ms

#4Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Ashwin Agrawal (#1)
Re: Zedstore - compressed in-core columnar storage

Hi,

On 09.04.2019 3:27, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Thank you for publishing this patch. IMHO Postgres is really missing
normal support of columnar store and table access method
API is the best way of integrating it.

I wanted to compare memory footprint and performance of zedstore with
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).
Unfortunately I failed to load data in zedstore:

postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR:  compression failed. what now?
Time: 237804.775 ms (03:57.805)

Then I try to check if there is something in
zedstore_lineitem_projection table:

postgres=# select count(*) from zedstore_lineitem_projection;
psql: WARNING:  terminating connection because of crash of another
server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
psql: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 145710.828 ms (02:25.711)

Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM
killer (undo log?)
Subsequent attempt to run the same command is failed with the following
error:

postgres=# select count(*) from zedstore_lineitem_projection;
psql: ERROR:  unexpected level encountered when descending tree

So the only thing I can do at this moment is report size of tables on
the disk:

postgres=# select pg_relation_size('lineitem');
 pg_relation_size
------------------
      10455441408
(1 row)

postgres=# select pg_relation_size('lineitem_projection');
 pg_relation_size
------------------
       3129974784
(1 row)

postgres=# select pg_relation_size('vops_lineitem_projection');
 pg_relation_size
------------------
       1535647744
(1 row)

postgres=# select pg_relation_size('zedstore_lineitem_projection');
 pg_relation_size
------------------
       2303688704
(1 row)

But I do not know how much data was actually loaded in zedstore table...
Actually the main question is why this table is not empty if INSERT
statement was failed?

Please let me know if I can somehow help you to reproduce and
investigate the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

vstore_bench.sqlapplication/sql; name=vstore_bench.sqlDownload
#5Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Konstantin Knizhnik (#4)
Re: Zedstore - compressed in-core columnar storage

On 09.04.2019 17:09, Konstantin Knizhnik wrote:

Hi,

On 09.04.2019 3:27, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Thank you for publishing this patch. IMHO Postgres is really missing
normal support of columnar store and table access method
API is the best way of integrating it.

I wanted to compare memory footprint and performance of zedstore with
standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).
Unfortunately I failed to load data in zedstore:

postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR:  compression failed. what now?
Time: 237804.775 ms (03:57.805)

Then I try to check if there is something in
zedstore_lineitem_projection table:

postgres=# select count(*) from zedstore_lineitem_projection;
psql: WARNING:  terminating connection because of crash of another
server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
psql: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 145710.828 ms (02:25.711)

Backend consumes 16GB of RAM and 16Gb of swap and was killed by OOM
killer (undo log?)
Subsequent attempt to run the same command is failed with the
following error:

postgres=# select count(*) from zedstore_lineitem_projection;
psql: ERROR:  unexpected level encountered when descending tree

So the only thing I can do at this moment is report size of tables on
the disk:

postgres=# select pg_relation_size('lineitem');
 pg_relation_size
------------------
      10455441408
(1 row)

postgres=# select pg_relation_size('lineitem_projection');
 pg_relation_size
------------------
       3129974784
(1 row)

postgres=# select pg_relation_size('vops_lineitem_projection');
 pg_relation_size
------------------
       1535647744
(1 row)

postgres=# select pg_relation_size('zedstore_lineitem_projection');
 pg_relation_size
------------------
       2303688704
(1 row)

But I do not know how much data was actually loaded in zedstore table...
Actually the main question is why this table is not empty if INSERT
statement was failed?

Please let me know if I can somehow help you to reproduce and
investigate the problem.

Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:

postgres=# select pg_relation_size('zedstore_lineitem_projection');
 pg_relation_size
------------------
       9363010640

No wonder that zedstore shows the worst results:

lineitem                                      6240.261 ms
lineitem_projection                    5390.446 ms
zedstore_lineitem_projection   23310.341 ms
vops_lineitem_projection             439.731 ms

Updated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

vstore_bench.sqlapplication/sql; name=vstore_bench.sqlDownload
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Konstantin Knizhnik (#5)
Re: Zedstore - compressed in-core columnar storage

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

On 09.04.2019 17:09, Konstantin Knizhnik wrote:

standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).

Cool, thanks!

Unfortunately I failed to load data in zedstore:

postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"
from lineitem);
psql: ERROR:  compression failed. what now?
Time: 237804.775 ms (03:57.805)

Yeah, it's still early days, it will crash and burn in a lot of cases.
We wanted to publish this early, to gather ideas and comments on the
high level design, and to validate that the table AM API that's in v12
is usable.

Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:

postgres=# select pg_relation_size('zedstore_lineitem_projection');
 pg_relation_size
------------------
       9363010640

The single-insert codepath isn't very optimized yet. If you populate the
table with large "INSERT ... SELECT ...", you end up with a huge undo
log. Try loading it with COPY.

You can also see how many pages of each type there is with:

select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)
from generate_series(0, pg_table_size('zedstore_lineitem_projection')
/ 8192 - 1) g group by 2;

- Heikki

#7Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Heikki Linnakangas (#6)
Re: Zedstore - compressed in-core columnar storage

On 09.04.2019 18:08, Heikki Linnakangas wrote:

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

On 09.04.2019 17:09, Konstantin Knizhnik wrote:

standard Postgres heap and my VOPS extension.
As test data I used TPC-H benchmark (actually only one lineitem table
generated with tpch-dbgen utility with scale factor 10 (~8Gb database).
I attached script which I have use to populate data (you have to to
download, build and run tpch-dbgen yourself, also you can comment code
related with VOPS).

Cool, thanks!

Unfortunately I failed to load data in zedstore:

postgres=# insert into zedstore_lineitem_projection (select
l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char"

from lineitem);
psql: ERROR:  compression failed. what now?
Time: 237804.775 ms (03:57.805)

Yeah, it's still early days, it will crash and burn in a lot of cases.
We wanted to publish this early, to gather ideas and comments on the
high level design, and to validate that the table AM API that's in v12
is usable.

Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:

postgres=# select pg_relation_size('zedstore_lineitem_projection');
   pg_relation_size
------------------
         9363010640

The single-insert codepath isn't very optimized yet. If you populate
the table with large "INSERT ... SELECT ...", you end up with a huge
undo log. Try loading it with COPY.

You can also see how many pages of each type there is with:

select count(*), pg_zs_page_type('zedstore_lineitem_projection', g)
  from generate_series(0,
pg_table_size('zedstore_lineitem_projection') / 8192 - 1) g group by 2;

- Heikki

postgres=# copy zedstore_lineitem from '/mnt/data/lineitem.tbl'
delimiter '|' csv;
COPY 59986052
Time: 232802.257 ms (03:52.802)
postgres=# select pg_relation_size('zedstore_lineitem');
 pg_relation_size
------------------
      10346504192
(1 row)
postgres=# select count(*), pg_zs_page_type('zedstore_lineitem', g)
  from generate_series(0, pg_table_size('zedstore_lineitem') / 8192 -
1) g group by 2;
  count  | pg_zs_page_type
---------+-----------------
       1 | META
 1262308 | BTREE
     692 | UNDO
(3 rows)

And now performance is much worser:
Time: 99819.476 ms (01:39.819)

It is strange, because the main advantage of columnar store is that it
has to fetch only accessed rows.
What I see is that in non-parallel mode (max_parallel_workers_per_gather
= 0)
backend consumes about 11GB of memory. It fits in my desktop RAM (16GB)
and speed is ~58 seconds.
But one I start 4 parallel workers, them cause huge swapping:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
28195 knizhnik  20   0 11.823g 6.553g 5.072g D   7.6 42.2   0:17.19
postgres
28074 knizhnik  20   0 11.848g 6.726g 5.223g D   7.3 43.3   4:14.96
postgres
28192 knizhnik  20   0 11.854g 6.586g 5.075g D   7.3 42.4   0:17.18
postgres
28193 knizhnik  20   0 11.870g 6.594g 5.064g D   7.3 42.4   0:17.19
postgres
28194 knizhnik  20   0 11.854g 6.589g 5.078g D   7.3 42.4   0:17.09
postgres

which is also strange because data should be present in shared buffers.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Konstantin Knizhnik (#4)
Re: Zedstore - compressed in-core columnar storage

On 2019-Apr-09, Konstantin Knizhnik wrote:

On 09.04.2019 3:27, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Thank you for publishing this patch. IMHO Postgres is really missing normal
support of columnar store

Yep.

and table access method API is the best way of integrating it.

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Alvaro Herrera (#8)
Re: Zedstore - compressed in-core columnar storage

On 09.04.2019 18:51, Alvaro Herrera wrote:

On 2019-Apr-09, Konstantin Knizhnik wrote:

On 09.04.2019 3:27, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Thank you for publishing this patch. IMHO Postgres is really missing normal
support of columnar store

Yep.

and table access method API is the best way of integrating it.

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.

There are two different aspects:
1. Store format.
2. Vector execution.

1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.

From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.

2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.

The right solution is to add vector operations support to Postgres
planner and executors.
But is much harder than develop columnar store itself.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Konstantin Knizhnik (#5)
Re: Zedstore - compressed in-core columnar storage

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:

postgres=# select pg_relation_size('zedstore_lineitem_projection');
 pg_relation_size
------------------
       9363010640

No wonder that zedstore shows the worst results:

lineitem                                      6240.261 ms
lineitem_projection                    5390.446 ms
zedstore_lineitem_projection   23310.341 ms
vops_lineitem_projection             439.731 ms

Updated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).

I tried this quickly, too. With default work_mem and no parallelism, and
1 gb table size, it seems that the query chooses a different plan with
heap and zedstore, with a sort+group for zedstore and hash agg for heap.
There's no ANALYZE support in zedstore yet, and we haven't given much
thought to parallelism either. With work_mem='1GB' and no parallelism,
both queries use a hash agg, and the numbers are much closer than what
you saw, about 6 s for heap, and 9 s for zedstore.

- Heikki

#11Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Heikki Linnakangas (#10)
Re: Zedstore - compressed in-core columnar storage

On 09.04.2019 19:19, Heikki Linnakangas wrote:

On 09/04/2019 18:00, Konstantin Knizhnik wrote:

Looks like the original problem was caused by internal postgres
compressor: I have not configured Postgres to use lz4.
When I configured Postgres --with-lz4, data was correctly inserted in
zedstore table, but looks it is not compressed at all:

postgres=# select pg_relation_size('zedstore_lineitem_projection');
   pg_relation_size
------------------
         9363010640

No wonder that zedstore shows the worst results:

lineitem                                      6240.261 ms
lineitem_projection                    5390.446 ms
zedstore_lineitem_projection   23310.341 ms
vops_lineitem_projection             439.731 ms

Updated version of vstore_bench.sql is attached (sorry, there was some
errors in previous version of this script).

I tried this quickly, too. With default work_mem and no parallelism,
and 1 gb table size, it seems that the query chooses a different plan
with heap and zedstore, with a sort+group for zedstore and hash agg
for heap. There's no ANALYZE support in zedstore yet, and we haven't
given much thought to parallelism either. With work_mem='1GB' and no
parallelism, both queries use a hash agg, and the numbers are much
closer than what you saw, about 6 s for heap, and 9 s for zedstore.

- Heikki

Yes, you was right. The plan for zedstore uses  GroupAggregate instead
of  HashAggregate.
Increasing work_mem force optimizer to use HashAggregate in all cases.
But it doesn't prevent memory overflow in my case.
And it is very strange to me, because there are just 4 groups in the
result, so it should not consume any  memory.

Yet another strange thing is that size of zedstore_table is 10Gb
according to pg_relation_size.
Q1 query access only some some subset of "lineitem" columns, not
touching the largest ones (with text).
I have configured 12Gb shared buffers. And all this 11Gb are used! Looks
like all columns are fetched from the disk.
And looks like except this 11Gb of shard data, backend (and each
parallel worker) is also consuming several gigabytes of heap memory.
As a result total size of used memory during parallel query execution
with 4 workers exceeds 20GB and cause terrible swapping at my system.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#8)
Re: Zedstore - compressed in-core columnar storage

On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.

I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.

Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.

This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.

BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Robert Haas (#12)
Re: Zedstore - compressed in-core columnar storage

On Tue, Apr 9, 2019 at 11:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.

I think that conclusion may be premature. There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.

Of course, we probably do not want to accept a ton of storage manager
implementations is core. I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those. But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store. I think that won't get out of control because few
people will be able to produce really high-quality implementations.

This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years. And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.

+1

BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"? That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.

Surprised its felt this thread would initiate the invitation to confusion.
Based on past internal and meetup discussions for few quite sometime now,
the confusion already exists for zheap pronunciation because of the reason
mentioned, as last letter is not pronounced universally same. Hence we
explicitly called it zedstore to learn from and make the pronunciation
world wide universal for new thing atleast.

#14Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Konstantin Knizhnik (#9)
Re: Zedstore - compressed in-core columnar storage

On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

On 09.04.2019 18:51, Alvaro Herrera wrote:

On 2019-Apr-09, Konstantin Knizhnik wrote:

On 09.04.2019 3:27, Ashwin Agrawal wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

Thank you for publishing this patch. IMHO Postgres is really missing

normal

support of columnar store

Yep.

and table access method API is the best way of integrating it.

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline. Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal. My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.

There are two different aspects:
1. Store format.
2. Vector execution.

1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.

From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.

Right, storing all the columns in single page doens't give any savings on
IO.

2. Doesn't matter which format we choose, to take all advantages of

vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all
cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.

Vectorized execution is orthogonal to storage format. It can be even
applied to row store and performance gained. Similarly column store without
vectorized execution also gives performance gain better compression rations
and such benefits. Column store clubbed with vecotorized execution makes it
lot more performant agree. Zedstore currently is focused to have AM piece
in place, which fits the postgres ecosystem and supports all the features
heap does.

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashwin Agrawal (#13)
Re: Zedstore - compressed in-core columnar storage

On 09/04/2019 23:24, Ashwin Agrawal wrote:

BTW, can I express a small measure of disappointment that the name for
the thing under discussion on this thread chose to be called
"zedstore"?  That seems to invite confusion with "zheap", especially
in parts of the world where the last letter of the alphabet is
pronounced "zed," where people are going to say zed-heap and
zed-store. Brr.

Surprised its felt this thread would initiate the invitation to
confusion. Based on past internal and meetup discussions for few quite
sometime now, the confusion already exists for zheap pronunciation
because of the reason mentioned, as last letter is not pronounced
universally same. Hence we explicitly called it zedstore to learn from
and make the pronunciation world wide universal for new thing atleast.

Yeah, you can blame me for the name. It's a pun on zheap. I'm hoping we
come up with a better name before this matures; I'm thinking it could be
just "column store" or something like that in the end, but it's good to
have a more unique name during development.

- Heikki

In reply to: Heikki Linnakangas (#15)
Re: Zedstore - compressed in-core columnar storage

C-Tree?

Peter Geoghegan
(Sent from my phone)

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashwin Agrawal (#1)
Re: Zedstore - compressed in-core columnar storage

On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:

Heikki and I have been hacking recently for few weeks to implement
in-core columnar storage for PostgreSQL. Here's the design and initial
implementation of Zedstore, compressed in-core columnar storage (table
access method). Attaching the patch and link to github branch [1] to
follow along.

The objective is to gather feedback on design and approach to the
same. The implementation has core basic pieces working but not close
to complete.

Big thank you to Andres, Haribabu and team for the table access method
API's. Leveraged the API's for implementing zedstore, and proves API
to be in very good shape. Had to enhance the same minimally but
in-general didn't had to touch executor much.

Motivations / Objectives

* Performance improvement for queries selecting subset of columns
(reduced IO).
* Reduced on-disk footprint compared to heap table. Shorter tuple
headers and also leveraging compression of similar type data
* Be first-class citizen in the Postgres architecture (tables data can
just independently live in columnar storage)
* Fully MVCC compliant
* All Indexes supported
* Hybrid row-column store, where some columns are stored together, and
others separately. Provide flexibility of granularity on how to
divide the columns. Columns accessed together can be stored
together.
* Provide better control over bloat (similar to zheap)
* Eliminate need for separate toast tables
* Faster add / drop column or changing data type of column by avoiding
full rewrite of the table.

High-level Design - B-trees for the win!
========================================

To start simple, let's ignore column store aspect for a moment and
consider it as compressed row store. The column store is natural
extension of this concept, explained in next section.

The basic on-disk data structure leveraged is a B-tree, indexed by
TID. BTree being a great data structure, fast and versatile. Note this
is not referring to existing Btree indexes, but instead net new
separate BTree for table data storage.

TID - logical row identifier:
TID is just a 48-bit row identifier. The traditional division into
block and offset numbers is meaningless. In order to find a tuple with
a given TID, one must always descend the B-tree. Having logical TID
provides flexibility to move the tuples around different pages on page
splits or page merges can be performed.

The internal pages of the B-tree are super simple and boring. Each
internal page just stores an array of TID and downlink pairs. Let's
focus on the leaf level. Leaf blocks have short uncompressed header,
followed by btree items. Two kinds of items exist:

- plain item, holds one tuple or one datum, uncompressed payload
- a "container item", holds multiple plain items, compressed payload

+-----------------------------
| Fixed-size page header:
|
| LSN
| TID low and hi key (for Lehman & Yao B-tree operations)
| left and right page pointers
|
| Items:
|
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | uncompressed size | lastTID | payload (container item)
| TID | size | flags | undo pointer | payload (plain item)
| TID | size | flags | undo pointer | payload (plain item)
| ...
|
+----------------------------

Row store
---------

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.

Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them. Have you thought how
will you implement the multi-locker scheme in this design? In zheap,
we have used undo for the same and it is easy to imagine when you have
separate transaction slots for each transaction. I am not sure how
will you implement the same here.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#17)
Re: Zedstore - compressed in-core columnar storage

On 10/04/2019 09:29, Amit Kapila wrote:

On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io> wrote:

Row store
---------

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.

Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.

Yeah. This does depend on compression to eliminate the unused fields
quite heavily at the moment. But you could have a flag in the header to
indicate "no undo pointer needed", and just leave it out, when it's needed.

Have you thought how will you implement the multi-locker scheme in
this design? In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction. I am not sure how will you implement the same here.

I've been thinking that the undo record would store all the XIDs
involved. So if there are multiple lockers, the UNDO record would store
a list of XIDs. Alternatively, I suppose you could store multiple UNDO
pointers for the same tuple.

- Heikki

#19Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Heikki Linnakangas (#18)
Re: Zedstore - compressed in-core columnar storage

On 10.04.2019 10:25, Heikki Linnakangas wrote:

On 10/04/2019 09:29, Amit Kapila wrote:

On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal@pivotal.io>
wrote:

Row store
---------

The tuples are stored one after another, sorted by TID. For each
tuple, we store its 48-bit TID, a undo record pointer, and the actual
tuple data uncompressed.

Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.

Yeah. This does depend on compression to eliminate the unused fields
quite heavily at the moment. But you could have a flag in the header
to indicate "no undo pointer needed", and just leave it out, when it's
needed.

Have you thought how will you implement the multi-locker scheme in
this design?  In zheap, we have used undo for the same and it is easy
to imagine when you have separate transaction slots for each
transaction.  I am not sure how will you implement the same here.

I've been thinking that the undo record would store all the XIDs
involved. So if there are multiple lockers, the UNDO record would
store a list of XIDs. Alternatively, I suppose you could store
multiple UNDO pointers for the same tuple.

- Heikki

I also a little bit confused about UNDO records and MVCC support in
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries  with relatively
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when
it is not needed (data is not updated).

Right now, even without UNFO pages, size of zedstore is larger than size
of original Postgres table.
It seems to be very strange.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Konstantin Knizhnik (#19)
Re: Zedstore - compressed in-core columnar storage

On 10/04/2019 10:38, Konstantin Knizhnik wrote:

I also a little bit confused about UNDO records and MVCC support in
Zedstore. Actually columnar store is mostly needed for analytic for
read-only or append-only data. One of the disadvantages of Postgres is
quite larger per-record space overhead caused by MVCC.
It may be critical if you want to store huge timeseries  with relatively
small number of columns (like measurements of some sensor).
It will be nice to have storage format which reduce this overhead when
it is not needed (data is not updated).

Sure. Definitely something we need to optimize.

Right now, even without UNFO pages, size of zedstore is larger than size
of original Postgres table.
It seems to be very strange.

If you have a table with a lot of columns, but each column is small,
e.g. lots of boolean columns, the item headers that zedstore currently
stores for each datum take up a lot of space. We'll need to squeeze
those harder to make this competitive. Instead of storing a header for
each datum, if a group of consecutive tuples have the same visibility
information, we could store the header just once, with an array of the
datums, for example.

- Heikki

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#18)
#22Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Ashwin Agrawal (#1)
#23Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Mark Kirkwood (#22)
#24Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Mark Kirkwood (#23)
#25Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Ashwin Agrawal (#24)
#26Rafia Sabih
rafia.pghackers@gmail.com
In reply to: Ashwin Agrawal (#1)
#27Rafia Sabih
rafia.pghackers@gmail.com
In reply to: Robert Haas (#12)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Rafia Sabih (#27)
#29Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Konstantin Knizhnik (#25)
#30Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andreas Karlsson (#29)
#31Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#28)
#32Rafia Sabih
rafia.pghackers@gmail.com
In reply to: Ashwin Agrawal (#1)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ashwin Agrawal (#24)
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#33)
In reply to: Rafia Sabih (#26)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#12)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#14)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#30)
#39Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Heikki Linnakangas (#34)
#40Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#38)
#41Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#1)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#40)
#43Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#36)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#39)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#43)
#46Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Peter Geoghegan (#35)
#47Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#39)
#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#47)
#49Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#36)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#45)
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#49)
#53Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#45)
#54Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#51)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#51)
#56Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#51)
#57Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#52)
In reply to: Tom Lane (#55)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#55)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#58)
#61Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#55)
In reply to: Tom Lane (#60)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#61)
#64Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#60)
In reply to: Ashwin Agrawal (#46)
#66Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#65)
In reply to: Andres Freund (#66)
#68Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#57)
#69Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Peter Geoghegan (#65)
#70Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ashwin Agrawal (#69)
#71Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tomas Vondra (#70)
#72Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#62)
#73Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Heikki Linnakangas (#72)
#74Ajin Cherian
itsajin@gmail.com
In reply to: Ashwin Agrawal (#73)
#75Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Ajin Cherian (#74)
#76Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Ashwin Agrawal (#73)
#77DEVOPS_WwIT
devops@ww-it.cn
In reply to: Ashwin Agrawal (#73)
#78Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Ashwin Agrawal (#1)
#79Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Tsunakawa, Takayuki (#78)
#80Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Konstantin Knizhnik (#4)
#81Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashwin Agrawal (#80)
#82Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Ashutosh Sharma (#81)
#83Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashwin Agrawal (#82)
#84Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashwin Agrawal (#82)
#85Justin Pryzby
pryzby@telsasoft.com
In reply to: Heikki Linnakangas (#84)
#86Alexandra Wang
lewang@pivotal.io
In reply to: Justin Pryzby (#85)
#87Justin Pryzby
pryzby@telsasoft.com
In reply to: Alexandra Wang (#86)
#88Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Justin Pryzby (#87)
#89Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Heikki Linnakangas (#83)
#90Ashwin Agrawal
aagrawal@pivotal.io
In reply to: Ashutosh Sharma (#89)
#91Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashwin Agrawal (#90)
#92Alexandra Wang
lewang@pivotal.io
In reply to: Ashutosh Sharma (#91)
#93Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Alexandra Wang (#92)
#94Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Sharma (#93)
#95Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Heikki Linnakangas (#94)
#96Alexandra Wang
lewang@pivotal.io
In reply to: Ashutosh Sharma (#95)
#97Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Alexandra Wang (#96)
#98Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#97)
#99Taylor Vesely
tvesely@pivotal.io
In reply to: Ashutosh Sharma (#98)
#100Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Taylor Vesely (#99)
#101Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#95)
#102Alexandra Wang
lewang@pivotal.io
In reply to: Ashutosh Sharma (#101)
#103Alexandra Wang
lewang@pivotal.io
In reply to: Alexandra Wang (#102)
#104Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Alexandra Wang (#102)
#105Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Ashutosh Sharma (#104)
#106Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Ashutosh Sharma (#105)
#107Ashutosh Sharma
ashu.coek88@gmail.com
In reply to: Heikki Linnakangas (#106)
#108Taylor Vesely
tvesely@pivotal.io
In reply to: Ashutosh Sharma (#107)
#109Taylor Vesely
tvesely@pivotal.io
In reply to: Taylor Vesely (#108)
#110Taylor Vesely
tvesely@pivotal.io
In reply to: Taylor Vesely (#109)
#111Soumyadeep Chakraborty
sochakraborty@pivotal.io
In reply to: Taylor Vesely (#110)
#112Alexandra Wang
lewang@pivotal.io
In reply to: Taylor Vesely (#109)
#113Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Alexandra Wang (#112)
#114Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Soumyadeep Chakraborty (#113)
#115Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Tomas Vondra (#114)
#116Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jacob Champion (#115)
#117Merlin Moncure
mmoncure@gmail.com
In reply to: Tomas Vondra (#114)
#118Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Merlin Moncure (#117)
#119Merlin Moncure
mmoncure@gmail.com
In reply to: Tomas Vondra (#118)
#120Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Tomas Vondra (#116)
#121Simon Riggs
simon@2ndQuadrant.com
In reply to: Jacob Champion (#120)
#122Robert Eckhardt
eckhardtr@vmware.com
In reply to: Simon Riggs (#121)
#123Andy Fan
zhihui.fan1213@gmail.com
In reply to: Robert Eckhardt (#122)