ADD COLUMN ts tsvector GENERATED too slow

Started by Florents Tselaialmost 4 years ago9 messagesgeneral

florents.tselai@gmail.com

almost 4 years ago

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

I’m adding the following generated col to keep up with tsvectors

ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED

I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’s a smarter way.

I thought about incremental updates and/or triggers but a generated col is a cleaner solution.

Peter Eisentraut

peter_e@gmx.net

almost 4 years ago

In reply to: Florents Tselai (#1)

Re: ADD COLUMN ts tsvector GENERATED too slow

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

I’m adding the following generated col to keep up with tsvectors

ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED

I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’s a smarter way.

Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an
exclusive lock on the table.

Florents Tselai

florents.tselai@gmail.com

almost 4 years ago

In reply to: Peter Eisentraut (#2)

Re: ADD COLUMN ts tsvector GENERATED too slow

On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.
I’m adding the following generated col to keep up with tsvectors
ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED
I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’s a smarter way.

Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an exclusive lock on the table.

Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar.

Florents Tselai

florents.tselai@gmail.com

almost 4 years ago

In reply to: Florents Tselai (#3)

Re: ADD COLUMN ts tsvector GENERATED too slow

Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.

Show quoted text

On 6 Jul 2022, at 12:48 PM, Florents Tselai <florents.tselai@gmail.com> wrote:

On 6 Jul 2022, at 12:38 PM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.
I’m adding the following generated col to keep up with tsvectors
ALTER TABLE docs_text ADD COLUMN ts tsvector GENERATED ALWAYS AS (to_tsvector(’simple', left(text, 1048575))) STORED
I expect this to be slow, but it’s been running for 18hrs already and I certainly hope I’ve done something wrong and there’s a smarter way.

Maybe it's stuck on a lock? ALTER TABLE / ADD COLUMN requires an exclusive lock on the table.

Looking at pg_locks and pg_stat_activity doesn’t seem like it’s stuck behind something like ACCESS SHARE or similar.

Francisco Olarte

folarte@peoplecall.com

almost 4 years ago

In reply to: Florents Tselai (#4)

Re: ADD COLUMN ts tsvector GENERATED too slow

On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:

Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.

...

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.

If you have stopped it I would try doing a 1000 row sample in a copied
table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of
progress, I'm not sure there is another thing you could look at
without disturbing it.

FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..

Francisco Olarte.

Florents Tselai

florents.tselai@gmail.com

almost 4 years ago

In reply to: Francisco Olarte (#5)

Re: ADD COLUMN ts tsvector GENERATED too slow

On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote:

On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:

Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.

...

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.

50M+ rows and iirc pg_relation_size was north of 80GB or so.

If you have stopped it I would try doing a 1000 row sample in a copied

Haven’t stopped it as I’m not convinced there’s an alternative to just waiting
For it to complete :/

table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of
progress, I'm not sure there is another thing you could look at
without disturbing it.

FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..

You’re probably right, a lot of the CPU usage could be detoasting.

Show quoted text

Francisco Olarte.

Florents Tselai

florents.tselai@gmail.com

almost 4 years ago

In reply to: Florents Tselai (#6)

Re: ADD COLUMN ts tsvector GENERATED too slow

On 6 Jul 2022, at 1:35 PM, Florents Tselai <florents.tselai@gmail.com> wrote:

On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte@peoplecall.com> wrote:

On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents.tselai@gmail.com> wrote:

Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.

...

On 06.07.22 10:42, Florents Tselai wrote:

I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
These are text fields extracted from 4-5 page pdfs each.

How big is yout table? from your query it seems you expect more than
1M-1 ( left... ), but if you have very big text columns it may be
spending a lot of time fully decompressing / reading them ( I'm not
sure if it left(..) on toasted values is optimized to stop after
reading enough ). Also, it has to rewrite a lot of data to insert the
columns, it it takes some ms per row which I would not discard 50M
rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
2 ms ( which may be right for reading a big row, calculating the
vector and writing an even bigger row ) it would take more than a day
to finish, which I would not discard given you are asking for a heavy
thing.

50M+ rows and iirc pg_relation_size was north of 80GB or so.

If you have stopped it I would try doing a 1000 row sample in a copied

Haven’t stopped it as I’m not convinced there’s an alternative to just waiting
For it to complete :/

table to get an speed idea. Otherwise, with this query, I would
normally monitor disk usage of disk files as an indication of

Actually, I monitored my disk usage and it was **definitely** working as
It had already eaten up an additional 30% of my disk capacity.

Thus, I’ll have to fall back on my initial solution and use GIN indexes
To get ts_vectors on the fly.

progress, I'm not sure there is another thing you could look at
without disturbing it.

FWIW, I would consider high mem usage normal in these kind of query,
hi cpu would depend on what you call it, but it wouldn't surprise me
if it has at least one cpu running at full detoasting and doing
vectors, I do not know if alter table can go paralell..

You’re probably right, a lot of the CPU usage could be detoasting.

Francisco Olarte.

Thanks everyone for your comments.
You can consider this solved.

Alvaro Herrera

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Florents Tselai (#7)

Re: ADD COLUMN ts tsvector GENERATED too slow

On 2022-Jul-06, Florents Tselai wrote:

Actually, I monitored my disk usage and it was **definitely** working as
It had already eaten up an additional 30% of my disk capacity.

Adding a column like this requires creating a second copy of the table,
copying all the contents from the old table (plus the new column) into
the new one, then recreating all indexes from scratch on the new copy of
the table. If you have a lot of indexes, this can be significant.
Also, AFAIU all data has to be uncompressed on read, then compressed
back on write.

Note: the 80 GB from pg_relation_size() does *not* include the size of
TOAST data. You're copying a lot of additional data. See
pg_table_size().

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)

Francisco Olarte

folarte@peoplecall.com

almost 4 years ago

In reply to: Florents Tselai (#6)

Re: ADD COLUMN ts tsvector GENERATED too slow

Hi Florents:

On Wed, 6 Jul 2022 at 12:35, Florents Tselai <florents.tselai@gmail.com> wrote:

50M+ rows and iirc pg_relation_size was north of 80GB or so.

Somebody already mentioned pg_table_size, but anyway this gives you
1.6Kb per row, which I would suspect is extremely low given your pdf
content and the 1M truncation you preventively applied.

When working with this kind of sizes it really pays to do some good
ole "back of the envelope" calculations and/or some tests with small
batches. Bear in mind if you have, say, 10k per row ( which I would
label as really low for pdf content ) you are going to have 500Gb of
data, if you manage to process this at 100Mb per second you will have
more than an hour ( and I think I'm estimating really low ).

FOS