Logical replication, need to reclaim big disk space

Started by Moreno Andreo11 months ago9 messagesgeneral

moreno.andreo@evolu-s.it

11 months ago

Hi,
we are moving our old binary data approach, moving them from bytea
fields in a table to external storage (making database smaller and
related operations faster and smarter).
In short, we have a job that runs in background and copies data from the
table to an external file and then sets the bytea field to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less than
one tenth in size.
We have a multi-tenant architecture (100s of schemas with identical
architecture, all inheriting from public) and we are performing the task
on one table per schema.

The problem is: this is generating BIG table bloat, as you may imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test server is
almost immediate.
If I had only one server, I'll process a table a time, with a nightly
script, and issue a VACUUM FULL to tables that have already been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make big
difference, since it's based on logical replication), and I'm building a
test cluster.

I've been instructed to issue VACUUM FULL on both nodes, nightly, but
before proceeding I read on docs that VACUUM FULL can disrupt logical
replication, so I'm a bit concerned on how to proceed. Rows are cleared
one a time (one transaction, one row, to keep errors to the record that
issued them)

I read about extensions like pg_squeeze, but I wonder if they are still
not dangerous for replication.

Thanks for your help.
Moreno.-

Achilleas Mantzios

a.mantzios@cloud.gatewaynet.com

11 months ago

In reply to: Moreno Andreo (#1)

Re: Logical replication, need to reclaim big disk space

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from bytea
fields in a table to external storage (making database smaller and
related operations faster and smarter).
In short, we have a job that runs in background and copies data from
the table to an external file and then sets the bytea field to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less
than one tenth in size.
We have a multi-tenant architecture (100s of schemas with identical
architecture, all inheriting from public) and we are performing the
task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those bytea
cols are selected, you won't even touch them. I cannot understand what
you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what you are
trying to achieve. Years after, I am seriously considering moving those
data back to PostgreSQL.

The problem is: this is generating BIG table bloat, as you may imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test server
is almost immediate.
If I had only one server, I'll process a table a time, with a nightly
script, and issue a VACUUM FULL to tables that have already been
processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make big
difference, since it's based on logical replication), and I'm building
a test cluster.

So you use PgEdge , but you wanna lose all the benefits of multi-master
, since your binary data won't be replicated ...

I've been instructed to issue VACUUM FULL on both nodes, nightly, but
before proceeding I read on docs that VACUUM FULL can disrupt logical
replication, so I'm a bit concerned on how to proceed. Rows are
cleared one a time (one transaction, one row, to keep errors to the
record that issued them)

PgEdge is based on the old pg_logical, the old 2ndQuadrant extension,
not the native logical replication we have since pgsql 10. But I might
be mistaken.

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying to
achieve here.

Show quoted text

Thanks for your help.
Moreno.-

Moreno Andreo

moreno.andreo@evolu-s.it

11 months ago

In reply to: Achilleas Mantzios (#2)

Re: Logical replication, need to reclaim big disk space

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database smaller
and related operations faster and smarter).
In short, we have a job that runs in background and copies data from
the table to an external file and then sets the bytea field to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less
than one tenth in size.
We have a multi-tenant architecture (100s of schemas with identical
architecture, all inheriting from public) and we are performing the
task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those bytea
cols are selected, you won't even touch them. I cannot understand what
you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what you
are trying to achieve. Years after, I am seriously considering moving
those data back to PostgreSQL.

The "related operations" I was talking about are backups and database
maintenance when needed, cluster/replica management, etc. With a smaller
database size they would be easier in timing and effort, right?
We are mostly talking about costs, here. To give things their names, I'm
moving bytea contents (85% of total data) to files into Google Cloud
Storage buckets, that has a fraction of the cost of the disks holding my
database (on GCE, to be clear ).
This data is not accessed frequently (just by the owner when he needs to
do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields in
databases is not recommended, but might have misunderstood this.
Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3 times the
buckets cost.

Why are you considering to get data back to database tables?

The problem is: this is generating BIG table bloat, as you may imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test server
is almost immediate.
If I had only one server, I'll process a table a time, with a nightly
script, and issue a VACUUM FULL to tables that have already been
processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make big
difference, since it's based on logical replication), and I'm
building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot be
"edited", so either it's there or it's been deleted. Buckets have
protections for data deletions or events like ransomware attacks and such.
Also multi-master was an absolute requirement one year ago because of a
project we were building, but it has been abandoned and now a simple
logical replication would be enough, but let's do one thing a time.

I've been instructed to issue VACUUM FULL on both nodes, nightly, but
before proceeding I read on docs that VACUUM FULL can disrupt logical
replication, so I'm a bit concerned on how to proceed. Rows are
cleared one a time (one transaction, one row, to keep errors to the
record that issued them)

PgEdge is based on the old pg_logical, the old 2ndQuadrant extension,
not the native logical replication we have since pgsql 10. But I might
be mistaken.

Don't know about this, it keeps running on latest pg versions (we are
about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying to
achieve here.

They are positive, it's they that suggested to do VACUUM FULL on both
nodes... I'm quite new to replication, so I'm searching some advise here.

Show quoted text

Thanks for your help.
Moreno.-

Achilleas Mantzios

a.mantzios@cloud.gatewaynet.com

11 months ago

In reply to: Moreno Andreo (#3)

Re: Logical replication, need to reclaim big disk space

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database smaller
and related operations faster and smarter).
In short, we have a job that runs in background and copies data from
the table to an external file and then sets the bytea field to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less
than one tenth in size.
We have a multi-tenant architecture (100s of schemas with identical
architecture, all inheriting from public) and we are performing the
task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what you
are trying to achieve. Years after, I am seriously considering moving
those data back to PostgreSQL.

The "related operations" I was talking about are backups and database
maintenance when needed, cluster/replica management, etc. With a
smaller database size they would be easier in timing and effort, right?

Ok, but you'll lose replica functionality for those blobs, which means
you don't care about them, correct me if I am wrong.

We are mostly talking about costs, here. To give things their names,
I'm moving bytea contents (85% of total data) to files into Google
Cloud Storage buckets, that has a fraction of the cost of the disks
holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

This data is not accessed frequently (just by the owner when he needs
to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields in
databases is not recommended, but might have misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is the
longevity or expected legitimacy of those ? I haven't worked with those
just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3 times
the buckets cost.

can you actually mount those Cloud Storage Buckets under a supported FS
in linux and just move them to tablespaces backed by this storage ?

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or just
upgrade or move the specific server which holds those data I will have
an extra headache. Also this is a single point of failure, or best case
a cause for fragmented technology introduced just for the sake of
keeping things out of the DB.

The problem is: this is generating BIG table bloat, as you may imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have already
been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make big
difference, since it's based on logical replication), and I'm
building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot be
"edited", so either it's there or it's been deleted. Buckets have
protections for data deletions or events like ransomware attacks and
such.
Also multi-master was an absolute requirement one year ago because of
a project we were building, but it has been abandoned and now a simple
logical replication would be enough, but let's do one thing a time.

Multi-master is cool, you can configure your pooler / clients to take
advantage of this for full load balanced architecture, but if not a
strict requirement , you can live without it, as so many of us, and
employ other means of load balancing the reads.

I've been instructed to issue VACUUM FULL on both nodes, nightly,
but before proceeding I read on docs that VACUUM FULL can disrupt
logical replication, so I'm a bit concerned on how to proceed. Rows
are cleared one a time (one transaction, one row, to keep errors to
the record that issued them)

Mind if you shared the specific doc ?

PgEdge is based on the old pg_logical, the old 2ndQuadrant extension,
not the native logical replication we have since pgsql 10. But I
might be mistaken.

Don't know about this, it keeps running on latest pg versions (we are
about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying to
achieve here.

They are positive, it's they that suggested to do VACUUM FULL on both
nodes... I'm quite new to replication, so I'm searching some advise here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla postgresql
and focus on standard physical and logical replication.

Show quoted text

Thanks for your help.
Moreno.-

Moreno Andreo

moreno.andreo@evolu-s.it

11 months ago

In reply to: Achilleas Mantzios (#4)

Re: Logical replication, need to reclaim big disk space

On 19/05/25 14:41, Achilleas Mantzios wrote:

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database
smaller and related operations faster and smarter).
In short, we have a job that runs in background and copies data
from the table to an external file and then sets the bytea field to
NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less
than one tenth in size.
We have a multi-tenant architecture (100s of schemas with identical
architecture, all inheriting from public) and we are performing the
task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what you
are trying to achieve. Years after, I am seriously considering
moving those data back to PostgreSQL.

The "related operations" I was talking about are backups and database
maintenance when needed, cluster/replica management, etc. With a
smaller database size they would be easier in timing and effort, right?

Ok, but you'll lose replica functionality for those blobs, which means
you don't care about them, correct me if I am wrong.

I'm not saying I don't care about them, the opposite, they are protected
with Object Versioning and soft deletion, this should assure a good
protection against e.g. ransomware, if someone manages to get in there
(and if this happens, we'll have bigger troubles than this)

We are mostly talking about costs, here. To give things their names,
I'm moving bytea contents (85% of total data) to files into Google
Cloud Storage buckets, that has a fraction of the cost of the disks
holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

single records vary from 150k to 80 MB, the grand total is more than 8,5
TB in a circa 10 TB data footprint

This data is not accessed frequently (just by the owner when he needs
to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields
in databases is not recommended, but might have misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is the
longevity or expected legitimacy of those ? I haven't worked with
those just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to a
0.17$/GB/month of an SSD or even more for the Hyperdisks we are using,
is a good price drop.

**

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3 times
the buckets cost.

can you actually mount those Cloud Storage Buckets under a supported
FS in linux and just move them to tablespaces backed by this storage ?

Never tried, I mounted this via FUSE and had some simple operations in
the past, but not sure it can handle database operations in terms of I/O
bandwidth

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or just
upgrade or move the specific server which holds those data I will have
an extra headache. Also this is a single point of failure, or best
case a cause for fragmented technology introduced just for the sake of
keeping things out of the DB.

This is managed as an hierarchical disk structure, so the calling server
may be literally everywhere, it just needs an account (or a service
account) to get in there

The problem is: this is generating BIG table bloat, as you may
imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have already
been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make
big difference, since it's based on logical replication), and I'm
building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot be
"edited", so either it's there or it's been deleted. Buckets have
protections for data deletions or events like ransomware attacks and
such.
Also multi-master was an absolute requirement one year ago because of
a project we were building, but it has been abandoned and now a
simple logical replication would be enough, but let's do one thing a
time.

Multi-master is cool, you can configure your pooler / clients to take
advantage of this for full load balanced architecture, but if not a
strict requirement , you can live without it, as so many of us, and
employ other means of load balancing the reads.

That's what we are doing, it's a really cool feature, but I experienced
(maybe because it uses old pglogical extension) that the replication is
a bit fragile, especially when dealing with those bytea fields (when I
ingest big loads, say 25-30 GB or more), it happened to break
replication, and recreating a replica from scratch with "normal size"
tables is not a big deal, since it can be achieved automatically,
because they normally fit in shared memory and can be transferred by the
replicator, but you can imagine what would be the effort and the
downtime necessary to create a base backup, transfer it to the replica,
build the DB and restart a 10-TB database (ATM we are running with a
2-node cluster).

I've been instructed to issue VACUUM FULL on both nodes, nightly,
but before proceeding I read on docs that VACUUM FULL can disrupt
logical replication, so I'm a bit concerned on how to proceed. Rows
are cleared one a time (one transaction, one row, to keep errors to
the record that issued them)

Mind if you shared the specific doc ?

Obviously I can't find it from a quick search, I'll search deeper, I
don't think it went off a dream :-).

PgEdge is based on the old pg_logical, the old 2ndQuadrant
extension, not the native logical replication we have since pgsql
10. But I might be mistaken.

Don't know about this, it keeps running on latest pg versions (we are
about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying to
achieve here.

They are positive, it's they that suggested to do VACUUM FULL on both
nodes... I'm quite new to replication, so I'm searching some advise
here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla postgresql
and focus on standard physical and logical replication.

No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or CloudSQL for
PostgreSQL, using logical replication (that seems the most reliable of
the two). We created a test case for both the options, and they seem to
be OK for now, even if I have still to do adequate stress tests. And
when I'll do the migration, I'd like to be migrating plain data only and
leave blobs where they are.

Show quoted text

Thanks for your help.
Moreno.-

Achilleas Mantzios

a.mantzios@cloud.gatewaynet.com

11 months ago

In reply to: Moreno Andreo (#5)

Re: Logical replication, need to reclaim big disk space

On 19/5/25 17:38, Moreno Andreo wrote:

On 19/05/25 14:41, Achilleas Mantzios wrote:

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database
smaller and related operations faster and smarter).
In short, we have a job that runs in background and copies data
from the table to an external file and then sets the bytea field
to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's less
than one tenth in size.
We have a multi-tenant architecture (100s of schemas with
identical architecture, all inheriting from public) and we are
performing the task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what
you are trying to achieve. Years after, I am seriously considering
moving those data back to PostgreSQL.

The "related operations" I was talking about are backups and
database maintenance when needed, cluster/replica management, etc.
With a smaller database size they would be easier in timing and
effort, right?

Ok, but you'll lose replica functionality for those blobs, which
means you don't care about them, correct me if I am wrong.

I'm not saying I don't care about them, the opposite, they are
protected with Object Versioning and soft deletion, this should assure
a good protection against e.g. ransomware, if someone manages to get
in there (and if this happens, we'll have bigger troubles than this).

PostgreSQL has become very popular because of ppl who care about their data.

We are mostly talking about costs, here. To give things their names,
I'm moving bytea contents (85% of total data) to files into Google
Cloud Storage buckets, that has a fraction of the cost of the disks
holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

single records vary from 150k to 80 MB, the grand total is more than
8,5 TB in a circa 10 TB data footprint

This data is not accessed frequently (just by the owner when he
needs to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields
in databases is not recommended, but might have misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is the
longevity or expected legitimacy of those ? I haven't worked with
those just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to a
0.17$/GB/month of an SSD or even more for the Hyperdisks we are using,
is a good price drop.

How about hosting your data in your own storage and spend 0$/GB/month ?

**

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3 times
the buckets cost.

can you actually mount those Cloud Storage Buckets under a supported
FS in linux and just move them to tablespaces backed by this storage ?

Never tried, I mounted this via FUSE and had some simple operations in
the past, but not sure it can handle database operations in terms of
I/O bandwidth

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or just
upgrade or move the specific server which holds those data I will
have an extra headache. Also this is a single point of failure, or
best case a cause for fragmented technology introduced just for the
sake of keeping things out of the DB.

This is managed as an hierarchical disk structure, so the calling
server may be literally everywhere, it just needs an account (or a
service account) to get in there ,

and you are locked in a proprietary solution. and at their mercy of any
future increases in cost.

The problem is: this is generating BIG table bloat, as you may
imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have
already been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make
big difference, since it's based on logical replication), and I'm
building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot be
"edited", so either it's there or it's been deleted. Buckets have
protections for data deletions or events like ransomware attacks and
such.
Also multi-master was an absolute requirement one year ago because
of a project we were building, but it has been abandoned and now a
simple logical replication would be enough, but let's do one thing a
time.

Multi-master is cool, you can configure your pooler / clients to take
advantage of this for full load balanced architecture, but if not a
strict requirement , you can live without it, as so many of us, and
employ other means of load balancing the reads.

That's what we are doing, it's a really cool feature, but I
experienced (maybe because it uses old pglogical extension) that the
replication is a bit fragile, especially when dealing with those bytea
fields (when I ingest big loads, say 25-30 GB or more), it happened to
break replication, and recreating a replica from scratch with "normal
size" tables is not a big deal, since it can be achieved
automatically, because they normally fit in shared memory and can be
transferred by the replicator, but you can imagine what would be the
effort and the downtime necessary to create a base backup, transfer it
to the replica, build the DB and restart a 10-TB database (ATM we are
running with a 2-node cluster).

Break this in batches, use modern techniques for robust data loading, in
smaller transactions, if you have to.

I've been instructed to issue VACUUM FULL on both nodes, nightly,
but before proceeding I read on docs that VACUUM FULL can disrupt
logical replication, so I'm a bit concerned on how to proceed.
Rows are cleared one a time (one transaction, one row, to keep
errors to the record that issued them)

Mind if you shared the specific doc ?

Obviously I can't find it from a quick search, I'll search deeper, I
don't think it went off a dream :-).

PgEdge is based on the old pg_logical, the old 2ndQuadrant
extension, not the native logical replication we have since pgsql
10. But I might be mistaken.

Don't know about this, it keeps running on latest pg versions (we
are about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying
to achieve here.

They are positive, it's they that suggested to do VACUUM FULL on
both nodes... I'm quite new to replication, so I'm searching some
advise here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla
postgresql and focus on standard physical and logical replication.

No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or CloudSQL
for PostgreSQL, using logical replication (that seems the most
reliable of the two). We created a test case for both the options, and
they seem to be OK for now, even if I have still to do adequate stress
tests. And when I'll do the migration, I'd like to be migrating plain
data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation as
you understand.

Personally I like to let the DB take care of the data, and I take care
of the DB, not a plethora of extra systems that we need to keep
connected and consistent.

Show quoted text

Thanks for your help.
Moreno.-

Moreno Andreo

moreno.andreo@evolu-s.it

11 months ago

In reply to: Achilleas Mantzios (#6)

Re: Logical replication, need to reclaim big disk space

On 19/05/25 20:49, Achilleas Mantzios wrote:

On 19/5/25 17:38, Moreno Andreo wrote:

On 19/05/25 14:41, Achilleas Mantzios wrote:

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database
smaller and related operations faster and smarter).
In short, we have a job that runs in background and copies data
from the table to an external file and then sets the bytea field
to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)

This results, at the end of the operations, to a table that's
less than one tenth in size.
We have a multi-tenant architecture (100s of schemas with
identical architecture, all inheriting from public) and we are
performing the task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what
you are trying to achieve. Years after, I am seriously considering
moving those data back to PostgreSQL.

The "related operations" I was talking about are backups and
database maintenance when needed, cluster/replica management, etc.
With a smaller database size they would be easier in timing and
effort, right?

Ok, but you'll lose replica functionality for those blobs, which
means you don't care about them, correct me if I am wrong.

I'm not saying I don't care about them, the opposite, they are
protected with Object Versioning and soft deletion, this should
assure a good protection against e.g. ransomware, if someone manages
to get in there (and if this happens, we'll have bigger troubles than
this).

PostgreSQL has become very popular because of ppl who care about their
data.

Yeah, it's always been famous for its robustness, and that's why I chose
PostgreSQL more than 10 years ago, and, in spite of how a "normal" user
treats his PC, we never had corruption (only where FS/disk were failing,
but that's not PG fault)

We are mostly talking about costs, here. To give things their
names, I'm moving bytea contents (85% of total data) to files into
Google Cloud Storage buckets, that has a fraction of the cost of
the disks holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

single records vary from 150k to 80 MB, the grand total is more than
8,5 TB in a circa 10 TB data footprint

This data is not accessed frequently (just by the owner when he
needs to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields
in databases is not recommended, but might have misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is the
longevity or expected legitimacy of those ? I haven't worked with
those just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to a
0.17$/GB/month of an SSD or even more for the Hyperdisks we are
using, is a good price drop.

How about hosting your data in your own storage and spend 0$/GB/month ?

If we could host on our own hardware I'd not be here talking. Maybe we
would have a 10-node full-mesh multimaster architecture with barman
backup on 2 separate SANs.
But we are a small company that has to balance performance, consistency,
security and, last but not latter, costs. And margins are tightening.

**

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3
times the buckets cost.

can you actually mount those Cloud Storage Buckets under a supported
FS in linux and just move them to tablespaces backed by this storage ?

Never tried, I mounted this via FUSE and had some simple operations
in the past, but not sure it can handle database operations in terms
of I/O bandwidth

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or just
upgrade or move the specific server which holds those data I will
have an extra headache. Also this is a single point of failure, or
best case a cause for fragmented technology introduced just for the
sake of keeping things out of the DB.

This is managed as an hierarchical disk structure, so the calling
server may be literally everywhere, it just needs an account (or a
service account) to get in there ,

and you are locked in a proprietary solution. and at their mercy of
any future increases in cost.

Since we cannot host on our hardware, the only thing is to keep an eye
on costs and migrate (yeah, more work) when it's becoming expensive.
Every solution is proprietary, if you want to run on cloud. Even the VMs
where PostgreSQL is running.

The problem is: this is generating BIG table bloat, as you may
imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have
already been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make
big difference, since it's based on logical replication), and I'm
building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot be
"edited", so either it's there or it's been deleted. Buckets have
protections for data deletions or events like ransomware attacks
and such.
Also multi-master was an absolute requirement one year ago because
of a project we were building, but it has been abandoned and now a
simple logical replication would be enough, but let's do one thing
a time.

Multi-master is cool, you can configure your pooler / clients to
take advantage of this for full load balanced architecture, but if
not a strict requirement , you can live without it, as so many of
us, and employ other means of load balancing the reads.

That's what we are doing, it's a really cool feature, but I
experienced (maybe because it uses old pglogical extension) that the
replication is a bit fragile, especially when dealing with those
bytea fields (when I ingest big loads, say 25-30 GB or more), it
happened to break replication, and recreating a replica from scratch
with "normal size" tables is not a big deal, since it can be achieved
automatically, because they normally fit in shared memory and can be
transferred by the replicator, but you can imagine what would be the
effort and the downtime necessary to create a base backup, transfer
it to the replica, build the DB and restart a 10-TB database (ATM we
are running with a 2-node cluster).

Break this in batches, use modern techniques for robust data loading,
in smaller transactions, if you have to.

Normally it's run via COPY commands, I can throttle COPY or break it in
batches. At the moment, while the schema is offline, we disconnect
replication from the bytea tables, feed them, wait for checkpoints to
return normal and then resume replication between tables before putting
schema online. This is safe, even if far from being optimized. It's a
migration tool, it won't be used forever, just to move customers from
their current architecture to new cloud one.

I've been instructed to issue VACUUM FULL on both nodes, nightly,
but before proceeding I read on docs that VACUUM FULL can disrupt
logical replication, so I'm a bit concerned on how to proceed.
Rows are cleared one a time (one transaction, one row, to keep
errors to the record that issued them)

Mind if you shared the specific doc ?

Obviously I can't find it from a quick search, I'll search deeper, I
don't think it went off a dream :-).

PgEdge is based on the old pg_logical, the old 2ndQuadrant
extension, not the native logical replication we have since pgsql
10. But I might be mistaken.

Don't know about this, it keeps running on latest pg versions (we
are about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they are
still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying
to achieve here.

They are positive, it's they that suggested to do VACUUM FULL on
both nodes... I'm quite new to replication, so I'm searching some
advise here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla
postgresql and focus on standard physical and logical replication.

No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or CloudSQL
for PostgreSQL, using logical replication (that seems the most
reliable of the two). We created a test case for both the options,
and they seem to be OK for now, even if I have still to do adequate
stress tests. And when I'll do the migration, I'd like to be
migrating plain data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation as
you understand.

Personally I like to let the DB take care of the data, and I take care
of the DB, not a plethora of extra systems that we need to keep
connected and consistent.

We followed this idea when the application (old version) was on customer
premises, so backups and operations were simple and getting in trouble
(e.g. customer deleting a directory from their PC) has happened a very
few times, just when they launched disk cleanup on windows :-)

Now we host a full cloud solution, so we got rid of many potential
problems generated by the end user, but bumped into other, as you
certainly imagine. We have to keep it consistent, fast, reliable,
keeping an eye on costs.
You are right, but the more I was working with this solution, the more
I'm having the impression of dealing with something heavy, hard to
mantain because of these rarely-accessed files that sum up most of my
data. Maybe it's just my impression, maybe I need some expertise in an
area that's still quite new for me.
At the moment that seems a good compromise between stability and costs.
Maybe in one year I'll be in your position (considering getting
everything back), but for now we are thinking forward in that way.

This been said, the original question :-)
Would be VACUUM FULL a risky operation? Has it to be done on all nodes,
obviously in a low-traffic and low-access timing (night)?

Show quoted text

Thanks for your help.
Moreno.-

Achilleas Mantzios

a.mantzios@cloud.gatewaynet.com

11 months ago

In reply to: Moreno Andreo (#7)

Re: Logical replication, need to reclaim big disk space

Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:

On 19/05/25 20:49, Achilleas Mantzios wrote:

On 19/5/25 17:38, Moreno Andreo wrote:

On 19/05/25 14:41, Achilleas Mantzios wrote:

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them from
bytea fields in a table to external storage (making database
smaller and related operations faster and smarter).
In short, we have a job that runs in background and copies data
from the table to an external file and then sets the bytea field
to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id =
<uuid>)

This results, at the end of the operations, to a table that's
less than one tenth in size.
We have a multi-tenant architecture (100s of schemas with
identical architecture, all inheriting from public) and we are
performing the task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what
you are trying to achieve. Years after, I am seriously
considering moving those data back to PostgreSQL.

The "related operations" I was talking about are backups and
database maintenance when needed, cluster/replica management, etc.
With a smaller database size they would be easier in timing and
effort, right?

Ok, but you'll lose replica functionality for those blobs, which
means you don't care about them, correct me if I am wrong.

I'm not saying I don't care about them, the opposite, they are
protected with Object Versioning and soft deletion, this should
assure a good protection against e.g. ransomware, if someone manages
to get in there (and if this happens, we'll have bigger troubles
than this).

PostgreSQL has become very popular because of ppl who care about
their data.

Yeah, it's always been famous for its robustness, and that's why I
chose PostgreSQL more than 10 years ago, and, in spite of how a
"normal" user treats his PC, we never had corruption (only where
FS/disk were failing, but that's not PG fault)

We are mostly talking about costs, here. To give things their
names, I'm moving bytea contents (85% of total data) to files into
Google Cloud Storage buckets, that has a fraction of the cost of
the disks holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

single records vary from 150k to 80 MB, the grand total is more than
8,5 TB in a circa 10 TB data footprint

This data is not accessed frequently (just by the owner when he
needs to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea
fields in databases is not recommended, but might have
misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is
the longevity or expected legitimacy of those ? I haven't worked
with those just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to
a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are
using, is a good price drop.

How about hosting your data in your own storage and spend 0$/GB/month ?

If we could host on our own hardware I'd not be here talking. Maybe we
would have a 10-node full-mesh multimaster architecture with barman
backup on 2 separate SANs.
But we are a small company that has to balance performance,
consistency, security and, last but not latter, costs. And margins are
tightening.

**

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3
times the buckets cost.

can you actually mount those Cloud Storage Buckets under a
supported FS in linux and just move them to tablespaces backed by
this storage ?

Never tried, I mounted this via FUSE and had some simple operations
in the past, but not sure it can handle database operations in terms
of I/O bandwidth

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or just
upgrade or move the specific server which holds those data I will
have an extra headache. Also this is a single point of failure, or
best case a cause for fragmented technology introduced just for the
sake of keeping things out of the DB.

This is managed as an hierarchical disk structure, so the calling
server may be literally everywhere, it just needs an account (or a
service account) to get in there ,

and you are locked in a proprietary solution. and at their mercy of
any future increases in cost.

Since we cannot host on our hardware, the only thing is to keep an eye
on costs and migrate (yeah, more work) when it's becoming expensive.
Every solution is proprietary, if you want to run on cloud. Even the
VMs where PostgreSQL is running.

The problem is: this is generating BIG table bloat, as you may
imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have
already been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will make
big difference, since it's based on logical replication), and
I'm building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot
be "edited", so either it's there or it's been deleted. Buckets
have protections for data deletions or events like ransomware
attacks and such.
Also multi-master was an absolute requirement one year ago because
of a project we were building, but it has been abandoned and now a
simple logical replication would be enough, but let's do one thing
a time.

Multi-master is cool, you can configure your pooler / clients to
take advantage of this for full load balanced architecture, but if
not a strict requirement , you can live without it, as so many of
us, and employ other means of load balancing the reads.

That's what we are doing, it's a really cool feature, but I
experienced (maybe because it uses old pglogical extension) that the
replication is a bit fragile, especially when dealing with those
bytea fields (when I ingest big loads, say 25-30 GB or more), it
happened to break replication, and recreating a replica from scratch
with "normal size" tables is not a big deal, since it can be
achieved automatically, because they normally fit in shared memory
and can be transferred by the replicator, but you can imagine what
would be the effort and the downtime necessary to create a base
backup, transfer it to the replica, build the DB and restart a 10-TB
database (ATM we are running with a 2-node cluster).

Break this in batches, use modern techniques for robust data loading,
in smaller transactions, if you have to.

Normally it's run via COPY commands, I can throttle COPY or break it
in batches. At the moment, while the schema is offline, we disconnect
replication from the bytea tables, feed them, wait for checkpoints to
return normal and then resume replication between tables before
putting schema online. This is safe, even if far from being optimized.
It's a migration tool, it won't be used forever, just to move
customers from their current architecture to new cloud one.

I've been instructed to issue VACUUM FULL on both nodes,
nightly, but before proceeding I read on docs that VACUUM FULL
can disrupt logical replication, so I'm a bit concerned on how
to proceed. Rows are cleared one a time (one transaction, one
row, to keep errors to the record that issued them)

Mind if you shared the specific doc ?

Obviously I can't find it from a quick search, I'll search deeper, I
don't think it went off a dream :-).

PgEdge is based on the old pg_logical, the old 2ndQuadrant
extension, not the native logical replication we have since pgsql
10. But I might be mistaken.

Don't know about this, it keeps running on latest pg versions (we
are about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they
are still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are trying
to achieve here.

They are positive, it's they that suggested to do VACUUM FULL on
both nodes... I'm quite new to replication, so I'm searching some
advise here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla
postgresql and focus on standard physical and logical replication.

No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or CloudSQL
for PostgreSQL, using logical replication (that seems the most
reliable of the two). We created a test case for both the options,
and they seem to be OK for now, even if I have still to do adequate
stress tests. And when I'll do the migration, I'd like to be
migrating plain data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation as
you understand.

Personally I like to let the DB take care of the data, and I take
care of the DB, not a plethora of extra systems that we need to keep
connected and consistent.

We followed this idea when the application (old version) was on
customer premises, so backups and operations were simple and getting
in trouble (e.g. customer deleting a directory from their PC) has
happened a very few times, just when they launched disk cleanup on
windows :-)

Now we host a full cloud solution, so we got rid of many potential
problems generated by the end user, but bumped into other, as you
certainly imagine. We have to keep it consistent, fast, reliable,
keeping an eye on costs.
You are right, but the more I was working with this solution, the more
I'm having the impression of dealing with something heavy, hard to
mantain because of these rarely-accessed files that sum up most of my
data. Maybe it's just my impression, maybe I need some expertise in an
area that's still quite new for me.
At the moment that seems a good compromise between stability and
costs. Maybe in one year I'll be in your position (considering getting
everything back), but for now we are thinking forward in that way.

Makes perfect sense.

This been said, the original question :-)
Would be VACUUM FULL a risky operation? Has it to be done on all
nodes, obviously in a low-traffic and low-access timing (night)?

VACUUM affects the physical blocks. In a physical streaming replication
scenario that might (or not) potentiallyt affect read-only queries on
the hot standby (depending on usage and settings). Normally I cannot see
how a VACUUM (plain or FULL) would interact with logical replication in
any way. But again, since you run PgEdge specific, you have to ask them.

Show quoted text

Thanks for your help.
Moreno.-

Moreno Andreo

moreno.andreo@evolu-s.it

11 months ago

In reply to: Achilleas Mantzios (#8)

Re: Logical replication, need to reclaim big disk space

On 20/05/25 12:58, Achilleas Mantzios wrote:

Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:

On 19/05/25 20:49, Achilleas Mantzios wrote:

On 19/5/25 17:38, Moreno Andreo wrote:

On 19/05/25 14:41, Achilleas Mantzios wrote:

On 5/19/25 09:14, Moreno Andreo wrote:

On 16/05/25 21:33, Achilleas Mantzios wrote:

On 16/5/25 18:45, Moreno Andreo wrote:

Hi,
we are moving our old binary data approach, moving them
from bytea fields in a table to external storage (making
database smaller and related operations faster and smarter).
In short, we have a job that runs in background and copies data
from the table to an external file and then sets the bytea
field to NULL.
(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id =
<uuid>)

This results, at the end of the operations, to a table that's
less than one tenth in size.
We have a multi-tenant architecture (100s of schemas with
identical architecture, all inheriting from public) and we are
performing the task on one table per schema.

So? toasted data are kept on separate TOAST tables, unless those
bytea cols are selected, you won't even touch them. I cannot
understand what you are trying to achieve here.

Years ago, when I made the mistake to go for a coffee and let my
developers "improvise" , the result was a design similar to what
you are trying to achieve. Years after, I am seriously
considering moving those data back to PostgreSQL.

The "related operations" I was talking about are backups and
database maintenance when needed, cluster/replica management,
etc. With a smaller database size they would be easier in timing
and effort, right?

Ok, but you'll lose replica functionality for those blobs, which
means you don't care about them, correct me if I am wrong.

I'm not saying I don't care about them, the opposite, they are
protected with Object Versioning and soft deletion, this should
assure a good protection against e.g. ransomware, if someone
manages to get in there (and if this happens, we'll have bigger
troubles than this).

PostgreSQL has become very popular because of ppl who care about
their data.

Yeah, it's always been famous for its robustness, and that's why I
chose PostgreSQL more than 10 years ago, and, in spite of how a
"normal" user treats his PC, we never had corruption (only where
FS/disk were failing, but that's not PG fault)

We are mostly talking about costs, here. To give things their
names, I'm moving bytea contents (85% of total data) to files
into Google Cloud Storage buckets, that has a fraction of the
cost of the disks holding my database (on GCE, to be clear ).

May I ask the size of the bytea data (uncompressed) ?.

single records vary from 150k to 80 MB, the grand total is more
than 8,5 TB in a circa 10 TB data footprint

This data is not accessed frequently (just by the owner when he
needs to do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea
fields in databases is not recommended, but might have
misunderstood this.

Ok, I assume those are unimportant data, but let me ask, what is
the longevity or expected legitimacy of those ? I haven't worked
with those just reading :

https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*

Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to
a 0.17$/GB/month of an SSD or even more for the Hyperdisks we are
using, is a good price drop.

How about hosting your data in your own storage and spend 0$/GB/month ?

If we could host on our own hardware I'd not be here talking. Maybe
we would have a 10-node full-mesh multimaster architecture with
barman backup on 2 separate SANs.
But we are a small company that has to balance performance,
consistency, security and, last but not latter, costs. And margins
are tightening.

**

Another way would have been to move these tables to a different
tablespace, in cheaper storage, but it still would have been 3
times the buckets cost.

can you actually mount those Cloud Storage Buckets under a
supported FS in linux and just move them to tablespaces backed by
this storage ?

Never tried, I mounted this via FUSE and had some simple operations
in the past, but not sure it can handle database operations in
terms of I/O bandwidth

Why are you considering to get data back to database tables?

Because now if we need to migrate from cloud to on-premise, or
just upgrade or move the specific server which holds those data I
will have an extra headache. Also this is a single point of
failure, or best case a cause for fragmented technology introduced
just for the sake of keeping things out of the DB.

This is managed as an hierarchical disk structure, so the calling
server may be literally everywhere, it just needs an account (or a
service account) to get in there ,

and you are locked in a proprietary solution. and at their mercy of
any future increases in cost.

Since we cannot host on our hardware, the only thing is to keep an
eye on costs and migrate (yeah, more work) when it's becoming
expensive. Every solution is proprietary, if you want to run on
cloud. Even the VMs where PostgreSQL is running.

The problem is: this is generating BIG table bloat, as you may
imagine.
Running a VACUUM FULL on an ex-22GB table on a standalone test
server is almost immediate.
If I had only one server, I'll process a table a time, with a
nightly script, and issue a VACUUM FULL to tables that have
already been processed.

But I'm in a logical replication architecture (we are using a
multimaster system called pgEdge, but I don't think it will
make big difference, since it's based on logical replication),
and I'm building a test cluster.

So you use PgEdge , but you wanna lose all the benefits of
multi-master , since your binary data won't be replicated ...

I don't think I need it to be replicated, since this data cannot
be "edited", so either it's there or it's been deleted. Buckets
have protections for data deletions or events like ransomware
attacks and such.
Also multi-master was an absolute requirement one year ago
because of a project we were building, but it has been abandoned
and now a simple logical replication would be enough, but let's
do one thing a time.

Multi-master is cool, you can configure your pooler / clients to
take advantage of this for full load balanced architecture, but if
not a strict requirement , you can live without it, as so many of
us, and employ other means of load balancing the reads.

That's what we are doing, it's a really cool feature, but I
experienced (maybe because it uses old pglogical extension) that
the replication is a bit fragile, especially when dealing with
those bytea fields (when I ingest big loads, say 25-30 GB or more),
it happened to break replication, and recreating a replica from
scratch with "normal size" tables is not a big deal, since it can
be achieved automatically, because they normally fit in shared
memory and can be transferred by the replicator, but you can
imagine what would be the effort and the downtime necessary to
create a base backup, transfer it to the replica, build the DB and
restart a 10-TB database (ATM we are running with a 2-node cluster).

Break this in batches, use modern techniques for robust data
loading, in smaller transactions, if you have to.

Normally it's run via COPY commands, I can throttle COPY or break it
in batches. At the moment, while the schema is offline, we disconnect
replication from the bytea tables, feed them, wait for checkpoints to
return normal and then resume replication between tables before
putting schema online. This is safe, even if far from being
optimized. It's a migration tool, it won't be used forever, just to
move customers from their current architecture to new cloud one.

I've been instructed to issue VACUUM FULL on both nodes,
nightly, but before proceeding I read on docs that VACUUM FULL
can disrupt logical replication, so I'm a bit concerned on how
to proceed. Rows are cleared one a time (one transaction, one
row, to keep errors to the record that issued them)

Mind if you shared the specific doc ?

Obviously I can't find it from a quick search, I'll search deeper,
I don't think it went off a dream :-).

PgEdge is based on the old pg_logical, the old 2ndQuadrant
extension, not the native logical replication we have since
pgsql 10. But I might be mistaken.

Don't know about this, it keeps running on latest pg versions (we
are about to upgrade to 17.4, if I'm not wrong), but I'll ask

I read about extensions like pg_squeeze, but I wonder if they
are still not dangerous for replication.

What's pgEdge take on that, I mean the bytea thing you are
trying to achieve here.

They are positive, it's they that suggested to do VACUUM FULL on
both nodes... I'm quite new to replication, so I'm searching some
advise here.

As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
native logical replication. You may look here :

https://github.com/pgEdge/spock

If multi-master is not a must you could convert to vanilla
postgresql and focus on standard physical and logical replication.

No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or
CloudSQL for PostgreSQL, using logical replication (that seems the
most reliable of the two). We created a test case for both the
options, and they seem to be OK for now, even if I have still to do
adequate stress tests. And when I'll do the migration, I'd like to
be migrating plain data only and leave blobs where they are.

as you wish. But this design has inherent data infra fragmentation
as you understand.

Personally I like to let the DB take care of the data, and I take
care of the DB, not a plethora of extra systems that we need to keep
connected and consistent.

We followed this idea when the application (old version) was on
customer premises, so backups and operations were simple and getting
in trouble (e.g. customer deleting a directory from their PC) has
happened a very few times, just when they launched disk cleanup on
windows :-)

Now we host a full cloud solution, so we got rid of many potential
problems generated by the end user, but bumped into other, as you
certainly imagine. We have to keep it consistent, fast, reliable,
keeping an eye on costs.
You are right, but the more I was working with this solution, the
more I'm having the impression of dealing with something heavy, hard
to mantain because of these rarely-accessed files that sum up most of
my data. Maybe it's just my impression, maybe I need some expertise
in an area that's still quite new for me.
At the moment that seems a good compromise between stability and
costs. Maybe in one year I'll be in your position (considering
getting everything back), but for now we are thinking forward in that
way.

Makes perfect sense.

This been said, the original question :-)
Would be VACUUM FULL a risky operation? Has it to be done on all
nodes, obviously in a low-traffic and low-access timing (night)?

VACUUM affects the physical blocks. In a physical streaming
replication scenario that might (or not) potentiallyt affect read-only
queries on the hot standby (depending on usage and settings). Normally
I cannot see how a VACUUM (plain or FULL) would interact with logical
replication in any way. But again, since you run PgEdge specific, you
have to ask them.

Thanks. This makes me think I misread or misinterpreted something. They
already suggested me that to use VACUUM FULL on both nodes, but that
"thing" I read (or I'm convinced to have) made me think twice before
crashing everything. Two experts' according words is quite enough for me.
I will start this evening and see what happens.

Thanks for the help and the very interesting discussion.

Show quoted text

Thanks for your help.
Moreno.-