Compressing large column by moving it to a unique table

Started by Adrian Garcia Badaraccoabout 2 years ago6 messagesgeneral

adrian@adriangb.com

about 2 years ago

I am using Timescale so I'll be mentioning some timestamp stuff but I think
this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and a
large JSONB column (let's call it `attributes`). `attributes` has 1000s of
schemas, but given a schema, there's a lot of duplication. Across all rows,
more than 99% of the data is duplicated (as measured by `count(attributes)`
vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`. I can then make a view that joins them `select
original_table.timestamp, new_table.attributes from original join new_table
on (time_bucket('1 day', timestamp) = day AND original.hash =
new_table.hash)` or something like that. The idea of time bucketing into 1
day is to balance write and read speed (by relying on timescale to do
efficient time partitioning, data retention, etc.).

I recognize this is essentially creating a key-value store in postgres and
also janky compression, so I am cautious about it.

Benchmarks showed huge wins (26x runtime, 100x buffers) for a selective
query on `data`:

SELECT count(*)
FROM joined_view
WHERE (
(attributes->'http.status_code')::int = 422
AND
start_timestamp > (now() - interval '2 day')
AND
start_timestamp < (now() - interval '1 day')
);

As expected `SELECT * FROM joined_view` is slower than `SELECT * FROM
original`, but not by much (5x slower, 2x buffers, but they're both slow).

What I was sad to realize was that `SELECT start_timestamp FROM
joined_view` is also slow because the query planner can't know that the
join becomes a no-op if you don't select the `attributes` column (since
there might be more than 1 match for each left row, thus modifying the
result even if `attributes` wasn't selected). I don't know that this is a
deal breaker but it's certainly a bummer.

Does anyone have any suggestions for this sort of situation?

Thank you,

Adrian

Ron

ronljohnsonjr@gmail.com

about 2 years ago

In reply to: Adrian Garcia Badaracco (#1)

Re: Compressing large column by moving it to a unique table

On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

I am using Timescale so I'll be mentioning some timestamp stuff but I
think this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and a
large JSONB column (let's call it `attributes`). `attributes` has 1000s of
schemas, but given a schema, there's a lot of duplication. Across all rows,
more than 99% of the data is duplicated (as measured by `count(attributes)`
vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`.

Meaning that there are many fewer rows in new_table?

I can then make a view that joins them `select original_table.timestamp,
new_table.attributes from original join new_table on (time_bucket('1 day',
timestamp) = day AND original.hash = new_table.hash)` or something like
that. The idea of time bucketing into 1 day is to balance write and read
speed (by relying on timescale to do efficient time partitioning, data
retention, etc.).

I recognize this is essentially creating a key-value store in postgres and
also janky compression, so I am cautious about it.

If my interpretation (that there are many fewer rows in new_table) is
correct, then you've stumbled into the Second Normal Form of database
design: https://en.wikipedia.org/wiki/Second_normal_form#Example

Adrian Garcia Badaracco

adrian@adriangb.com

about 2 years ago

In reply to: Ron (#2)

Re: Compressing large column by moving it to a unique table

Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution, I
am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Show quoted text

On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

I am using Timescale so I'll be mentioning some timestamp stuff but I
think this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and a
large JSONB column (let's call it `attributes`). `attributes` has 1000s of
schemas, but given a schema, there's a lot of duplication. Across all rows,
more than 99% of the data is duplicated (as measured by `count(attributes)`
vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`.

Meaning that there are many fewer rows in new_table?

I can then make a view that joins them `select original_table.timestamp,
new_table.attributes from original join new_table on (time_bucket('1 day',
timestamp) = day AND original.hash = new_table.hash)` or something like
that. The idea of time bucketing into 1 day is to balance write and read
speed (by relying on timescale to do efficient time partitioning, data
retention, etc.).

I recognize this is essentially creating a key-value store in postgres
and also janky compression, so I am cautious about it.

If my interpretation (that there are many fewer rows in new_table) is
correct, then you've stumbled into the Second Normal Form of database
design: https://en.wikipedia.org/wiki/Second_normal_form#Example

Ron

ronljohnsonjr@gmail.com

about 2 years ago

In reply to: Adrian Garcia Badaracco (#3)

Re: Compressing large column by moving it to a unique table

1. Show us the PG version, view definition, the exact query that's slow,
and the EXPLAIN (ANALYZE).
2. Presumably there's an index on each table's *_hash column?

On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <adrian@adriangb.com>
wrote:

Show quoted text

Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution, I
am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

I am using Timescale so I'll be mentioning some timestamp stuff but I
think this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and a
large JSONB column (let's call it `attributes`). `attributes` has 1000s of
schemas, but given a schema, there's a lot of duplication. Across all rows,
more than 99% of the data is duplicated (as measured by `count(attributes)`
vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`.

Meaning that there are many fewer rows in new_table?

I can then make a view that joins them `select original_table.timestamp,
new_table.attributes from original join new_table on (time_bucket('1 day',
timestamp) = day AND original.hash = new_table.hash)` or something like
that. The idea of time bucketing into 1 day is to balance write and read
speed (by relying on timescale to do efficient time partitioning, data
retention, etc.).

I recognize this is essentially creating a key-value store in postgres
and also janky compression, so I am cautious about it.

If my interpretation (that there are many fewer rows in new_table) is
correct, then you've stumbled into the Second Normal Form of database
design: https://en.wikipedia.org/wiki/Second_normal_form#Example

Adrian Garcia Badaracco

adrian@adriangb.com

about 2 years ago

In reply to: Ron (#4)

Re: Compressing large column by moving it to a unique table

I'm using PostgreSQL 15.5.

Here's a self-contained example. I included links to public pgMustard query
plans.

Gist link: https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4

Also copied below for archiving:

```sql
CREATE OR REPLACE FUNCTION random_bytes(length integer)
RETURNS bytea AS $$
DECLARE
bytes bytea := '';
i integer := 0;
BEGIN
-- generate 1024 bytes at a time using gen_random_bytes(1024)
WHILE i < length LOOP
bytes := bytes || gen_random_bytes(least(1024, length - i));
i := i + 1024;
END LOOP;

RETURN bytes;
END;
$$ LANGUAGE plpgsql;

DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE;
DROP VIEW IF EXISTS joined_view;

CREATE TABLE original AS (
WITH large_random_bytes AS (
SELECT n AS id, random_bytes(4096 + n) AS attributes
FROM generate_series(1, 1000) n
)
SELECT
-- An incrementing timestamp
'2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval
AS start_timestamp,
-- Another relatively small column, let's just make it a random string
including `n` to make it unique
'random_string_' || (n % 100) AS event_name,
-- The attributes column
lrb.attributes AS attributes
FROM generate_series(0, 1000000) n
JOIN large_random_bytes lrb ON n % 1000 = lrb.id
);
CREATE INDEX ON original(start_timestamp);

CREATE TABLE dictionary_table AS (
SELECT DISTINCT time_bucket('1 day', start_timestamp) AS
start_timestamp_range, attributes, md5(attributes) AS hash
FROM original
);
CREATE INDEX ON dictionary_table (start_timestamp_range, hash);

CREATE TABLE original_hashed AS (
SELECT
start_timestamp,
event_name,
md5(attributes) AS hash
FROM original
);
CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day',
start_timestamp), hash);

CREATE VIEW joined_view AS (
SELECT
original_hashed.start_timestamp,
original_hashed.event_name,
dictionary_table.attributes
FROM original_hashed
LEFT JOIN dictionary_table ON (
time_bucket('1 day', original_hashed.start_timestamp) =
dictionary_table.start_timestamp_range
AND
original_hashed.hash = dictionary_table.hash
)
);

-- Select all data
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be

-- A relatively aggregation selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM joined_view
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3

explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM original
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8

-- Select only some columns
explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM joined_view
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c

explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM original
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a

-- A relatively selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/a75b20bd-07d5-4402-a0d8-22419682307a

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval '1
day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval '2
day')
);
-- https://app.pgmustard.com/#/explore/47ef84f0-a96e-4baa-af40-2ec241cbb6e2
```

On Tue, Feb 13, 2024 at 11:04 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

Show quoted text

1. Show us the PG version, view definition, the exact query that's slow,
and the EXPLAIN (ANALYZE).
2. Presumably there's an index on each table's *_hash column?

On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution,
I am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

I am using Timescale so I'll be mentioning some timestamp stuff but I
think this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and
a large JSONB column (let's call it `attributes`). `attributes` has 1000s
of schemas, but given a schema, there's a lot of duplication. Across all
rows, more than 99% of the data is duplicated (as measured by
`count(attributes)` vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`.

Meaning that there are many fewer rows in new_table?

I can then make a view that joins them `select
original_table.timestamp, new_table.attributes from original join new_table
on (time_bucket('1 day', timestamp) = day AND original.hash =
new_table.hash)` or something like that. The idea of time bucketing into 1
day is to balance write and read speed (by relying on timescale to do
efficient time partitioning, data retention, etc.).

I recognize this is essentially creating a key-value store in postgres
and also janky compression, so I am cautious about it.

If my interpretation (that there are many fewer rows in new_table) is
correct, then you've stumbled into the Second Normal Form of database
design: https://en.wikipedia.org/wiki/Second_normal_form#Example

Ron

ronljohnsonjr@gmail.com

about 2 years ago

In reply to: Adrian Garcia Badaracco (#5)

Re: Compressing large column by moving it to a unique table

On original_hashed, I think I'd try moving start_timestamp into its own
index.

On Tue, Feb 13, 2024 at 12:02 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Show quoted text

I'm using PostgreSQL 15.5.

Here's a self-contained example. I included links to public pgMustard
query plans.

Gist link:
https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4

Also copied below for archiving:

```sql
CREATE OR REPLACE FUNCTION random_bytes(length integer)
RETURNS bytea AS $$
DECLARE
bytes bytea := '';
i integer := 0;
BEGIN
-- generate 1024 bytes at a time using gen_random_bytes(1024)
WHILE i < length LOOP
bytes := bytes || gen_random_bytes(least(1024, length - i));
i := i + 1024;
END LOOP;

RETURN bytes;
END;
$$ LANGUAGE plpgsql;

DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE;
DROP VIEW IF EXISTS joined_view;

CREATE TABLE original AS (
WITH large_random_bytes AS (
SELECT n AS id, random_bytes(4096 + n) AS attributes
FROM generate_series(1, 1000) n
)
SELECT
-- An incrementing timestamp
'2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval
AS start_timestamp,
-- Another relatively small column, let's just make it a random string
including `n` to make it unique
'random_string_' || (n % 100) AS event_name,
-- The attributes column
lrb.attributes AS attributes
FROM generate_series(0, 1000000) n
JOIN large_random_bytes lrb ON n % 1000 = lrb.id
);
CREATE INDEX ON original(start_timestamp);

CREATE TABLE dictionary_table AS (
SELECT DISTINCT time_bucket('1 day', start_timestamp) AS
start_timestamp_range, attributes, md5(attributes) AS hash
FROM original
);
CREATE INDEX ON dictionary_table (start_timestamp_range, hash);

CREATE TABLE original_hashed AS (
SELECT
start_timestamp,
event_name,
md5(attributes) AS hash
FROM original
);
CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day',
start_timestamp), hash);

CREATE VIEW joined_view AS (
SELECT
original_hashed.start_timestamp,
original_hashed.event_name,
dictionary_table.attributes
FROM original_hashed
LEFT JOIN dictionary_table ON (
time_bucket('1 day', original_hashed.start_timestamp) =
dictionary_table.start_timestamp_range
AND
original_hashed.hash = dictionary_table.hash
)
);

-- Select all data
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be

-- A relatively aggregation selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM joined_view
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3

explain (analyze, buffers, verbose, settings, format json)
SELECT count(*)
FROM original
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8

-- Select only some columns
explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM joined_view
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c

explain (analyze, buffers, verbose, settings, format json)
SELECT start_timestamp
FROM original
WHERE (
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a

-- A relatively selective query
explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM joined_view
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/a75b20bd-07d5-4402-a0d8-22419682307a

explain (analyze, buffers, verbose, settings, format json)
SELECT *
FROM original
WHERE (
get_byte(attributes, 4) < 100
AND
start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'1 day')
AND
start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval
'2 day')
);
--
https://app.pgmustard.com/#/explore/47ef84f0-a96e-4baa-af40-2ec241cbb6e2
```

On Tue, Feb 13, 2024 at 11:04 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

1. Show us the PG version, view definition, the exact query that's slow,
and the EXPLAIN (ANALYZE).
2. Presumably there's an index on each table's *_hash column?

On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

Thank you for the reply Ron.

Yes there are many fewer (<1%) the number of rows in new_table.

Thanks for making me think of normalization, I hadn’t seen it that way.
Although there is no theoretical relationship between the rows in the other
columns in the original table and the attributes column, in practice there
is a strong correlation, so I guess what I am trying to capture here is
taking advantage of that correlation, while not completely depending on it
because it can be broken.

In any case, whatever theoretical framework is put around this solution,
I am also interested in the practical aspects, in particular that case of
selecting a subset of columns from the view that I know doesn’t need the
join but the query planner thinks does.

On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco <
adrian@adriangb.com> wrote:

I am using Timescale so I'll be mentioning some timestamp stuff but I
think this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and
a large JSONB column (let's call it `attributes`). `attributes` has 1000s
of schemas, but given a schema, there's a lot of duplication. Across all
rows, more than 99% of the data is duplicated (as measured by
`count(attributes)` vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`.

Meaning that there are many fewer rows in new_table?

I can then make a view that joins them `select
original_table.timestamp, new_table.attributes from original join new_table
on (time_bucket('1 day', timestamp) = day AND original.hash =
new_table.hash)` or something like that. The idea of time bucketing into 1
day is to balance write and read speed (by relying on timescale to do
efficient time partitioning, data retention, etc.).

I recognize this is essentially creating a key-value store in postgres
and also janky compression, so I am cautious about it.

If my interpretation (that there are many fewer rows in new_table) is
correct, then you've stumbled into the Second Normal Form of database
design: https://en.wikipedia.org/wiki/Second_normal_form#Example