ZSON, PostgreSQL extension for compressing JSONB

Started by Aleksander Alekseevover 9 years ago14 messagesgeneral
Jump to latest
#1Aleksander Alekseev
aleksander@timescale.com

Hello.

I've just uploaded ZSON extension on GitHub:

https://github.com/afiskon/zson

ZSON learns on your common JSONB documents and creates a dictionary
with strings that are frequently used in all documents. After that you
can use ZSON type to compress documents using this dictionary. When
documents schema evolve and compression becomes inefficient you can
re-learn on new documents. New documents will be compressed with a new
dictionary, old documents will be decompressed using old dictionary.

In some cases ZSON can save half of your disk space and give you about
10% more TPS. Everything depends on your data and workload though.
Memory is saved as well. For more details see README.md.

Please don't hesitate to ask any questions. Any feedback and pull
requests are welcome too!

--
Best regards,
Aleksander Alekseev

#2Alan Gano
alangano@gmail.com
In reply to: Aleksander Alekseev (#1)
Re: ZSON, PostgreSQL extension for compressing JSONB

I like this, seeing that the keys of JSON docs are replicated in every
record.

I makes my old-school DBA-Sense start to itch.

On Fri, Sep 30, 2016 at 8:58 AM, Aleksander Alekseev <
a.alekseev@postgrespro.ru> wrote:

Show quoted text

Hello.

I've just uploaded ZSON extension on GitHub:

https://github.com/afiskon/zson

ZSON learns on your common JSONB documents and creates a dictionary
with strings that are frequently used in all documents. After that you
can use ZSON type to compress documents using this dictionary. When
documents schema evolve and compression becomes inefficient you can
re-learn on new documents. New documents will be compressed with a new
dictionary, old documents will be decompressed using old dictionary.

In some cases ZSON can save half of your disk space and give you about
10% more TPS. Everything depends on your data and workload though.
Memory is saved as well. For more details see README.md.

Please don't hesitate to ask any questions. Any feedback and pull
requests are welcome too!

--
Best regards,
Aleksander Alekseev

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Aleksander Alekseev (#1)
Re: ZSON, PostgreSQL extension for compressing JSONB

On 30 September 2016 at 16:58, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I've just uploaded ZSON extension on GitHub:

https://github.com/afiskon/zson

ZSON learns on your common JSONB documents and creates a dictionary
with strings that are frequently used in all documents. After that you
can use ZSON type to compress documents using this dictionary. When
documents schema evolve and compression becomes inefficient you can
re-learn on new documents. New documents will be compressed with a new
dictionary, old documents will be decompressed using old dictionary.

In some cases ZSON can save half of your disk space and give you about
10% more TPS. Everything depends on your data and workload though.
Memory is saved as well. For more details see README.md.

Please don't hesitate to ask any questions. Any feedback and pull
requests are welcome too!

Very good. Oleg had mentioned that dictionary compression was being considered.

It would be useful to be able to define compression dictionaries for
many use cases.

Will you be submitting this to core?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Simon Riggs (#3)
Re: ZSON, PostgreSQL extension for compressing JSONB

On Tue, Oct 4, 2016 at 4:20 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 September 2016 at 16:58, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I've just uploaded ZSON extension on GitHub:

https://github.com/afiskon/zson

ZSON learns on your common JSONB documents and creates a dictionary
with strings that are frequently used in all documents. After that you
can use ZSON type to compress documents using this dictionary. When
documents schema evolve and compression becomes inefficient you can
re-learn on new documents. New documents will be compressed with a new
dictionary, old documents will be decompressed using old dictionary.

In some cases ZSON can save half of your disk space and give you about
10% more TPS. Everything depends on your data and workload though.
Memory is saved as well. For more details see README.md.

Please don't hesitate to ask any questions. Any feedback and pull
requests are welcome too!

Very good. Oleg had mentioned that dictionary compression was being
considered.

It would be useful to be able to define compression dictionaries for
many use cases.

Dictionary compression is a different project, we'll publish it after
testing.

Show quoted text

Will you be submitting this to core?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Aleksander Alekseev
aleksander@timescale.com
In reply to: Simon Riggs (#3)
Re: ZSON, PostgreSQL extension for compressing JSONB

Hello, Simon.

Thanks for you interest to this project!

Will you be submitting this to core?

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

--
Best regards,
Aleksander Alekseev

#6Dorian Hoxha
dorian.hoxha@gmail.com
In reply to: Aleksander Alekseev (#5)
Re: ZSON, PostgreSQL extension for compressing JSONB

@Aleksander
~everyone wants lower data storage and wants some kind of compression.
Can this be made to automatically retrain when analyzing (makes sense?)?
And create a new dictionary only if it changes compared to the last one.

On Tue, Oct 4, 2016 at 5:34 PM, Aleksander Alekseev <
a.alekseev@postgrespro.ru> wrote:

Show quoted text

Hello, Simon.

Thanks for you interest to this project!

Will you be submitting this to core?

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

--
Best regards,
Aleksander Alekseev

#7Michael Paquier
michael@paquier.xyz
In reply to: Aleksander Alekseev (#5)
Re: ZSON, PostgreSQL extension for compressing JSONB

On Wed, Oct 5, 2016 at 12:34 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

I find the references to pglz quite troubling, particularly by reading
that this data type visibly uses its own compression logic.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Vitaly Burovoy
vitaly.burovoy@gmail.com
In reply to: Dorian Hoxha (#6)
Re: ZSON, PostgreSQL extension for compressing JSONB

On 10/4/16, Dorian Hoxha <dorian.hoxha@gmail.com> wrote:

On Tue, Oct 4, 2016 at 5:34 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

Hello, Simon.

Thanks for you interest to this project!

Will you be submitting this to core?

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

~everyone wants lower data storage and wants some kind of compression.
Can this be made to automatically retrain when analyzing (makes sense?)?
And create a new dictionary only if it changes compared to the last one.

ANALYZE does not walk through all the table and does not change its records.
Creating a new dictionary for compressing purposes supposes updating
rows of the original table to replace entries to references to a
dictionary.

--
Best regards,
Vitaly Burovoy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Aleksander Alekseev
aleksander@timescale.com
In reply to: Dorian Hoxha (#6)
Re: ZSON, PostgreSQL extension for compressing JSONB

~everyone wants lower data storage and wants some kind of compression.
Can this be made to automatically retrain when analyzing (makes sense?)?
And create a new dictionary only if it changes compared to the last one.

It's an interesting idea. However I doubt it could be automated in
one-size-fits-all manner. One users would like to do re-learning during
analyzing, others during vacuum, once a month or say using triggers and
some sort of heuristics.

Despite that I see no reason not to accept pull requests with
implementations of different re-learning automation strategies. It's
just not a priority for me personally.

--
Best regards,
Aleksander Alekseev

#10Aleksander Alekseev
aleksander@timescale.com
In reply to: Michael Paquier (#7)
Re: ZSON, PostgreSQL extension for compressing JSONB

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

I find the references to pglz quite troubling, particularly by reading
that this data type visibly uses its own compression logic.

ZSON just replaces frequently used strings to 16-bit codes. It also adds
PGLZ_HINT_SIZE (=32 by default, could be also 0) zero bytes in the
beginning to make it more likely that data will be compressed using PGLZ.
After all, who will use ZSON for small documents? Thats all.

Hope it explains references to PGLZ.

--
Best regards,
Aleksander Alekseev

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Aleksander Alekseev (#5)
Re: ZSON, PostgreSQL extension for compressing JSONB

On 4 October 2016 at 16:34, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

Hello, Simon.

Thanks for you interest to this project!

Will you be submitting this to core?

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

CREATE COMPRESSION DICTIONARY public.simple_dict (
WORDS = my_json_schema
);

Then use the dictionary by referencing the DictionaryId within the
datatype modifier, e.g. ZSON(357)

That way we can use this for TEXT, XML, JSON etc.. as ZTEXT, ZXML, ZJSON

So it seems like a generally useful thing to me.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Aleksander Alekseev
aleksander@timescale.com
In reply to: Simon Riggs (#11)
Re: ZSON, PostgreSQL extension for compressing JSONB

I could align ZSON to PostgreSQL code style. I only need to run pgindent
and write a few comments. Do you think community would be interested in
adding it to /contrib/ ? I mean doesn't ZSON solve a bit too specific
problem for this?

CREATE COMPRESSION DICTIONARY public.simple_dict (
WORDS = my_json_schema
);

Then use the dictionary by referencing the DictionaryId within the
datatype modifier, e.g. ZSON(357)

That way we can use this for TEXT, XML, JSON etc.. as ZTEXT, ZXML, ZJSON

So it seems like a generally useful thing to me.

Good idea!

What about evolving schema of JSON/JSONB/XML? For instance,
adding/removing keys in new versions of the application. UPDATE
COMPRESSION DICTIONARY?

--
Best regards,
Aleksander Alekseev

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Aleksander Alekseev (#12)
Re: ZSON, PostgreSQL extension for compressing JSONB

On 5 October 2016 at 16:58, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:

What about evolving schema of JSON/JSONB/XML? For instance,
adding/removing keys in new versions of the application. UPDATE
COMPRESSION DICTIONARY?

You can add to a dictionary, but not remove things. I'm not sure
that's a big issue.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14Aleksander Alekseev
aleksander@timescale.com
In reply to: Aleksander Alekseev (#1)
Re: ZSON, PostgreSQL extension for compressing JSONB

Hello, Eduardo.

Why do you use a dictionary compression and not zlib/lz4/bzip/anyother?

Internally PostgreSQL already has LZ77 family algorithm - PGLZ. I didn't
try to replace it, only to supplement. PGLZ compresses every piece of
data (JSONB documents in this case) independently. What I did is removed
redundant data that exists between documents and that PGLZ can't
compress since every single document usually uses every key and similar
strings (some sort of string tags in arrays, etc) only once.

Compress/Decompress speed?

By my observations PGLZ has characteristics similar to GZIP. I didn't
benchmark ZSON encoding/decoding separately from DBMS because end
user is interested only in TPS which depends on IO, amount of documents
that we could fit into memory and other factors.

As I understand, postgresql must decompress before use.

Only if you try to read document fields. For deleting a tuple, doing
vacuum, etc there is no need to decompress a data.

Some compressing algs (dictionary transforms where a token is word)
allow search for tokens/words directly on compressed data transforming
the token/word to search in dictionary entry and searching it in
compressed data. From it, replace, substring, etc... string
manipulations algs at word level can be implemented.

Unfortunately I doubt that current ZSON implementation can use these
ideas. However I must agree that it's a very interesting field of
research. I don't think anyone tried to do something like this in
PostgreSQL yet.

My passion is compression, do you care if I try other algorithms? For
that, some dict id numbers (>1024 or >1<<16 or <128 for example) say
which compression algorithm is used or must change zson_header to store
that information. Doing that, each document could be compressed with
the best compressor (size or decompression speed) at idle times or at
request.

By all means! Naturally if you'll find a better encoding I would be happy
to merge corresponding code in ZSON's repository.

Thanks for sharing and time.

Thanks for feedback and sharing your thoughts!

--
Best regards,
Aleksander Alekseev