Rearchitecting for storage

Started by Matthew Pounsettover 6 years ago19 messagesgeneral
Jump to latest
#1Matthew Pounsett
matt@conundrum.com

I've recently inherited a database that is dangerously close to outgrowing
the available storage on its existing hardware. I'm looking for (pointers
to) advice on scaling the storage in a financially constrained
not-for-profit.

The current size of the DB's data directory is just shy of 23TB. When I
received the machine it's on, it was configured with 18x3TB drives in
RAID10 (9x 2-drive mirrors striped together) for about 28TB of available
storage. As a short term measure I've reconfigured them into RAID50 (3x
6-drive RAID5 arrays). This is obviously a poor choice for performance,
but it'll get us through until we figure out what to do about
upgrading/replacing the hardware. The host is constrained to 24x3TB
drives, so we can't get much of an upgrade by just adding/replacing disks.

One of my anticipated requirements for any replacement we design is that I
should be able to do upgrades of Postgres for up to five years without
needing major upgrades to the hardware. My understanding of the standard
upgrade process is that this requires that the data directory be smaller
than the free storage (so that there is room to hold two copies of the data
directory simultaneously). I haven't got detailed growth statistics yet,
but given that the DB has grown to 23TB in 5 years, I should assume that it
could double in the next five years, requiring 100TB of available storage
to be able to do updates.

This seems to be right on the cusp of what is possible to fit in a single
chassis with a RAID10 configuration (at least, with commodify hardware),
which means we're looking at pretty high cost:performance ratio. I'd like
to see if we can find designs that get that ratio down a bit, or a lot, but
I'm a general sysadmin, and the detailed effects on those choices are
outside of my limited DBA experience.

Are there good documents out there on sizing hardware for this sort of
mid-range storage requirement, that is neither big data, nor "small data"
able to fit on a single host? I'm hoping for an overview of the tradeoffs
between single head, dual-head setups with a JBOD array, or whatever else
is advisable to consider these days. Corrections of any poor assumptions
exposed above are also quite welcome. :)

Thanks in advance for any assistance!

#2Kenneth Marshall
ktm@rice.edu
In reply to: Matthew Pounsett (#1)
Re: Rearchitecting for storage

Hi Matt,

On Thu, Jul 18, 2019 at 09:44:04AM -0400, Matthew Pounsett wrote:

I've recently inherited a database that is dangerously close to outgrowing
the available storage on its existing hardware. I'm looking for (pointers
to) advice on scaling the storage in a financially constrained
not-for-profit.

Have you considered using the VDO compression for tables that are less
update intensive. Using just compression you can get almost 4X size
reduction. For a database, I would forgo the deduplication function.
You can then use a non-compressed tablespace for the heavier I/O tables
and indexes.

One of my anticipated requirements for any replacement we design is that I
should be able to do upgrades of Postgres for up to five years without
needing major upgrades to the hardware. My understanding of the standard
upgrade process is that this requires that the data directory be smaller
than the free storage (so that there is room to hold two copies of the data
directory simultaneously). I haven't got detailed growth statistics yet,
but given that the DB has grown to 23TB in 5 years, I should assume that it
could double in the next five years, requiring 100TB of available storage
to be able to do updates.

The link option with pg_upgrade does not require 2X the space, since it
uses hard links instead of copying the files to the new cluster.

Regards,
Ken

#3Matthew Pounsett
matt@conundrum.com
In reply to: Kenneth Marshall (#2)
Re: Rearchitecting for storage

On Thu, 18 Jul 2019 at 13:34, Kenneth Marshall <ktm@rice.edu> wrote:

Hi Matt,

Hi! Thanks for your reply.

Have you considered using the VDO compression for tables that are less
update intensive. Using just compression you can get almost 4X size
reduction. For a database, I would forgo the deduplication function.
You can then use a non-compressed tablespace for the heavier I/O tables
and indexes.

VDO is a RedHat-only thing, isn't it? We're not running RHEL... Debian.
Anyway, the bulk of the data (nearly 80%) is in a single table and its
indexes. ~6TB to the table, and ~12TB to its indices. Even if we switched
over to RedHat, there's no value in compressing lesser-used tables.

My understanding of the standard
upgrade process is that this requires that the data directory be smaller
than the free storage (so that there is room to hold two copies of the

data

directory simultaneously).

The link option with pg_upgrade does not require 2X the space, since it
uses hard links instead of copying the files to the new cluster.

That would likely keep the extra storage requirements small, but still
non-zero. Presumably the upgrade would be unnecessary if it could be done
without rewriting files. Is there any rule of thumb for making sure one
has enough space available for the upgrade? I suppose that would come
down to what exactly needs to get rewritten, in what order, etc., but the
pg_upgrade docs don't seem to have that detail. For example, since we've
got an ~18TB table (including its indices), if that needs to be rewritten
then we're still looking at requiring significant extra storage. Recent
experience suggests postgres won't necessarily do things in the most
storage-efficient way.. we just had a reindex on that database fail (in
--single-user) because 17TB was insufficient free storage for the db to
grow into.

#4Rob Sargent
robjsargent@gmail.com
In reply to: Matthew Pounsett (#3)
Re: Rearchitecting for storage

That would likely keep the extra storage requirements small, but still non-zero. Presumably the upgrade would be unnecessary if it could be done without rewriting files. Is there any rule of thumb for making sure one has enough space available for the upgrade? I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail. For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage. Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.

Can you afford to drop and re-create those 6 indices?

#5Andy Colson
andy@squeakycode.net
In reply to: Matthew Pounsett (#1)
Re: Rearchitecting for storage

On 7/18/19 8:44 AM, Matthew Pounsett wrote:

I've recently inherited a database that is dangerously close to outgrowing the available storage on its existing hardware.  I'm looking for (pointers to) advice on scaling the storage in a financially constrained not-for-profit.

The current size of the DB's data directory is just shy of 23TB.  When I received the machine it's on, it was configured with 18x3TB drives in RAID10 (9x 2-drive mirrors striped together) for about 28TB of available storage.  As a short term measure I've reconfigured them into RAID50 (3x 6-drive RAID5 arrays).  This is obviously a poor choice for performance, but it'll get us through until we figure out what to do about upgrading/replacing the hardware.  The host is constrained to 24x3TB drives, so we can't get much of an upgrade by just adding/replacing disks.

One of my anticipated requirements for any replacement we design is that I should be able to do upgrades of Postgres for up to five years without needing major upgrades to the hardware.  My understanding of the standard upgrade process is that this requires that the data directory be smaller than the free storage (so that there is room to hold two copies of the data directory simultaneously).  I haven't got detailed growth statistics yet, but given that the DB has grown to 23TB in 5 years, I should assume that it could double in the next five years, requiring 100TB of available storage to be able to do updates.

This seems to be right on the cusp of what is possible to fit in a single chassis with a RAID10 configuration (at least, with commodify hardware), which means we're looking at pretty high cost:performance ratio.  I'd like to see if we can find designs that get that ratio down a bit, or a lot, but I'm a general sysadmin, and the detailed effects on those choices are outside of my limited DBA experience.

Are there good documents out there on sizing hardware for this sort of mid-range storage requirement, that is neither big data, nor "small data" able to fit on a single host?   I'm hoping for an overview of the tradeoffs between single head, dual-head setups with a JBOD array, or whatever else is advisable to consider these days.  Corrections of any poor assumptions exposed above are also quite welcome. :)

Thanks in advance for any assistance!

Now might be a good time to consider splitting the database onto multiple computers. Might be simpler with a mid-range database, then your plan for the future is "add more computers".

-Andy

#6Luca Ferrari
fluca1978@gmail.com
In reply to: Matthew Pounsett (#3)
Re: Rearchitecting for storage

On Thu, Jul 18, 2019 at 10:09 PM Matthew Pounsett <matt@conundrum.com> wrote:

That would likely keep the extra storage requirements small, but still non-zero. Presumably the upgrade would be unnecessary if it could be done without rewriting files. Is there any rule of thumb for making sure one has enough space available for the upgrade? I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail. For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage. Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.

This could be trivial, but any chance you can partition the table
and/or archive unused records (at least temporarly)? A 18 TB table
quite frankly sounds a good candidate to contain records no one is
interested in the near future.
In any case, if you can partition the table chances are you can at
least do a per-table backup that could simplify maintanance of the
database.

In desperate order, I would check also the log files (I mean, textual
logs, not wals) because occasionally I found them requiring a few GBs
on my disk, and that can be easily archived to gain some more extra
space.
Then I would go for some commodity NAS to attach as extra storage, at
least for the upgrade process.

If any of the following fails, I would probably drop all the indexes
to gain extra space, perform the upgrade, and then reindex (removing
the old cluster, in the case it has not been upgraded with the link
option).

Luca

#7Matthew Pounsett
matt@conundrum.com
In reply to: Rob Sargent (#4)
Re: Rearchitecting for storage

On Thu, 18 Jul 2019 at 19:53, Rob Sargent <robjsargent@gmail.com> wrote:

That would likely keep the extra storage requirements small, but still

non-zero. Presumably the upgrade would be unnecessary if it could be done
without rewriting files. Is there any rule of thumb for making sure one
has enough space available for the upgrade? I suppose that would come
down to what exactly needs to get rewritten, in what order, etc., but the
pg_upgrade docs don't seem to have that detail. For example, since we've
got an ~18TB table (including its indices), if that needs to be rewritten
then we're still looking at requiring significant extra storage. Recent
experience suggests postgres won't necessarily do things in the most
storage-efficient way.. we just had a reindex on that database fail (in
--single-user) because 17TB was insufficient free storage for the db to
grow into.

Can you afford to drop and re-create those 6 indices?

Technically, yes. I don't see any reason we'd be prevented from doing
that. But, rebuilding them will take a long time. That's a lot of
downtime to incur any time we update the DB. I'd prefer to avoid it if I
can. For scale, the recent 'reindex database' that failed ran for nine
days before it ran out of room, and that was in single-user. Trying to do
that concurrently would take a lot longer, I imagine.

#8Matthew Pounsett
matt@conundrum.com
In reply to: Andy Colson (#5)
Re: Rearchitecting for storage

On Thu, 18 Jul 2019 at 21:59, Andy Colson <andy@squeakycode.net> wrote:

Now might be a good time to consider splitting the database onto multiple
computers. Might be simpler with a mid-range database, then your plan for
the future is "add more computers".

Hmm... yes. Range partitioning seems like a possible way forward. I
hadn't considered that yet. We might hold that back for when a dual-head
approach (server + jbod) can't scale anymore, but I think that's a long way
in the future.

Show quoted text
#9Matthew Pounsett
matt@conundrum.com
In reply to: Luca Ferrari (#6)
Re: Rearchitecting for storage

On Fri, 19 Jul 2019 at 04:21, Luca Ferrari <fluca1978@gmail.com> wrote:

This could be trivial, but any chance you can partition the table
and/or archive unused records (at least temporarly)? A 18 TB table
quite frankly sounds a good candidate to contain records no one is
interested in the near future.

Partitioning is a possibility. The whole database is historical test
results, stored specifically for doing comparisons over time, so I'm not
sure we can actually archive anything. Expiring old test data is a
discussion we might have to have, eventually.

In any case, if you can partition the table chances are you can at
least do a per-table backup that could simplify maintanance of the
database.

My current backup plan for this database is on-site replication, and a
monthly pg_dump from the standby to be copied off-site. Doing per-table
backups sounds like a great way to end up with an inconsistent backup, but
perhaps I misunderstand what you mean.

Another possibility is putting the server into backup mode and taking a
snapshot of the filesystem, but coordinating that across chassis (in the
case where partitioning is used) in such a way that the db is consistent
sounds like a hard problem... unless issuing pg_start_backup on the chassis
holding the master table coordinates backup mode on all the chassis holding
child tables at the same time? I haven't read enough on that yet.

In desperate order, I would check also the log files (I mean, textual
logs, not wals) because occasionally I found them requiring a few GBs
on my disk, and that can be easily archived to gain some more extra
space.
Then I would go for some commodity NAS to attach as extra storage, at
least for the upgrade process.

Okay. So I guess the short answer is no, nobody really knows how to judge
how much space is required for an upgrade? :)

Any logs we have are going to be a rounding error when compared to the
database itself. And buying storage last-minute because an upgrade failed
is exactly the sort of thing that a resource constrained not-for-profit
can't do. We really need to be able to plan this out long term so that we
get as much as possible out of every dollar.

If any of the following fails, I would probably drop all the indexes
to gain extra space, perform the upgrade, and then reindex (removing
the old cluster, in the case it has not been upgraded with the link
option).

Yeah, this sort of trial-and-error approach to getting upgrades done will
bother me, but it seems like it'll be necessary once we start growing into
whatever new home we get for the db.

Thanks very much for your time on this.

In reply to: Matthew Pounsett (#9)
Re: Rearchitecting for storage

Hi Matt,

On Fri, Jul 19, 2019 at 10:41:31AM -0400, Matthew Pounsett wrote:

On Fri, 19 Jul 2019 at 04:21, Luca Ferrari <fluca1978@gmail.com> wrote:

This could be trivial, but any chance you can partition the table
and/or archive unused records (at least temporarly)? A 18 TB table
quite frankly sounds a good candidate to contain records no one is
interested in the near future.

Partitioning is a possibility. The whole database is historical test
results, stored specifically for doing comparisons over time, so I'm not
sure we can actually archive anything. Expiring old test data is a
discussion we might have to have, eventually.

This is a case were using a compressed filesystem can give you space
savings as well as faster read performance due to the compression
factor. In my case a sequential scan of something in the compressed
tablespace runs almost 3X faster than on the non-compressed one.

Regards,
Ken

#11Matthew Pounsett
matt@conundrum.com
In reply to: Matthew Pounsett (#1)
Re: Rearchitecting for storage

On Thu, 18 Jul 2019 at 09:44, Matthew Pounsett <matt@conundrum.com> wrote:

I've recently inherited a database that is dangerously close to outgrowing
the available storage on its existing hardware. I'm looking for (pointers
to) advice on scaling the storage in a financially constrained
not-for-profit.

Thanks for your replies, everyone. Here's a quick summary of what I've got
out of this.

Although nobody really addressed the core question of the performance
tradeoffs in different storage architectures, perhaps the fact that nobody
mentioned them means there really aren't any. We'll proceed on the
assumption that externally attached storage really doesn't make a
difference. NAS storage seems like a poor choice to me, for performance
reasons, and nobody's really said anything to disabuse me of that notion.

We're going to have a look at the relative costs of single-head solutions
as well as dual head (server + jbod) setups, and see what gets us the most
growth for the least cost. We'll plan for enough storage to get us five
years of growth, and just accept that maybe in the 5th year we won't be
able to do in-place upgrades without dropping indexes.

Horizontal scalability through range partitioning sounds interesting, but I
don't think it's a cost-effective solution for us right now. As long as
it's possible for the db to fit in a single server (or jbod) using
commodity hardware, the incremental cost of adding more chassis (and
therefore more motherbaords, more CPUs, more memory) isn't offset by a
reduced cost anywhere else (e.g. using cheaper drives). And that's not
even accounting for the increased operational cost of coordinating the DB
across multiple servers. It could be a useful approach if DB growth
outpaces historical averages and we need to add hardware before a normal
replacement cycle. It could also be useful at the end of that replacement
cycle if DB growth has outpaced commodity hardware improvements, and single
server solutions are no longer viable.

The DB server we inherited is currently a single instance, but once we
expand and have replication in place I'll have to do some testing to see if
LVM compression gives us any performance boost (or incurs an acceptable
performance cost). The big question there is whether the processing
required to do the compression is faster than the difference in read times
on the disk... I think that might be dependent on the data and how it's
accessed. It certainly seems like it could give us some benefits, but I
don't think it's an experiment I want to attempt with only a single
production copy of the DB; the downtime required to rebuild the DB server
for A+B comparisons would be unacceptable.

Thanks again everyone. This has been educational.

#12Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Matthew Pounsett (#9)
Re: Rearchitecting for storage

On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:

Okay.  So I guess the short answer is no, nobody really knows how to
judge how much space is required for an upgrade?  :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

#13Matthew Pounsett
matt@conundrum.com
In reply to: Peter J. Holzer (#12)
Re: Rearchitecting for storage

On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:

Okay. So I guess the short answer is no, nobody really knows how to
judge how much space is required for an upgrade? :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

The documentation for pg_upgrade --link says that the old copy is no longer
usable, which means it's modifying files that are linked. If it were only
modifying small housekeeping files, then it would be most efficient not to
link those, which would keep both copies of the db usable. That seems
incompatible with your suggestion that it doesn't need to modify the data
files. Depending on how it goes about doing that, it could mean a
significant short-term increase in storage requirements while the data is
being converted.

Going back to our recent 'reindex database' attempt, pgsql does not
necessarily do these things in the most storage-efficient manner; it seems
entirely likely that it would choose to use links to duplicate the data
directory, then create copies of each data file as it converts them over,
then link that back to the original for an atomic replacement. That could
eat up a HUGE amount of storage during the conversion process without the
start and end sizes being very different at all.

Sorry, but I can't reconcile your use of "as I understand it" with your use
of "definitely". It sounds like you're guessing, rather than speaking from
direct knowledge of how the internals of pg_upgrade.

In reply to: Matthew Pounsett (#7)
Re: Rearchitecting for storage

Matthew Pounsett <matt@conundrum.com> writes:

On Thu, 18 Jul 2019 at 19:53, Rob Sargent <robjsargent@gmail.com> wrote:

Can you afford to drop and re-create those 6 indices?

Technically, yes. I don't see any reason we'd be prevented from doing that. But, rebuilding them will take a long time. That's a lot of downtime to incur any time we update
the DB. I'd prefer to avoid it if I can. For scale, the recent 'reindex database' that failed ran for nine days before it ran out of room, and that was in single-user. Trying to do
that concurrently would take a lot longer, I imagine.

This may be a stupid question, but are you certain they are all used? It
wouldn't be the first time that I've seen someone create indexes and
then never use them. This script can tell you if there are any indexes
that seems largely unused.

https://github.com/pgexperts/pgx_scripts/blob/master/indexes/unused_indexes.sql

If you can run your application without access to the indexes for a
while you can create them concurrently in the background using "CREATE
INDEX CONCURRENTLY ...".

Best regards,
Jacob

In reply to: Matthew Pounsett (#3)
Re: Rearchitecting for storage

Matthew Pounsett <matt@conundrum.com> writes:

[...] Is there any rule of thumb for making sure one has enough space
available for the upgrade?

No, because it depends greatly on which version you are upgrading from
and which version you are upgrading to etc.

Perhaps you could carve out a slice of data, e.g. 1 GB and load it into
a test database and try to upgrade that. That would probably give you an
idea.

Also, you mentioned that your database contains historical test data¹,
then I would guess that one of the indexes is related to timestamps? But
maybe you could live with a smaller BRIN index for the timestamps:
https://www.postgresql.org/docs/11/brin-intro.html - that could
potentially save some space, and may not have been something on the
radar when the database was first developed.

Best regards,
Jacob

¹) I think I know which kind of data based on your progress reports on
a DNS related list I'm subscribed to.

#16Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Matthew Pounsett (#13)
Re: Rearchitecting for storage

On 2019-07-19 11:37:52 -0400, Matthew Pounsett wrote:

On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:

Okay.  So I guess the short answer is no, nobody really knows how to
judge how much space is required for an upgrade?  :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

The documentation for pg_upgrade --link says that the old copy is no longer
usable, which means it's modifying files that are linked.  If it were only
modifying small housekeeping files, then it would be most efficient not to link
those, which would keep both copies of the db usable.

This was discussed recently: The old database is made intentionally
unusable to prevent accidentally starting both (which would result in
data corruption).

That seems incompatible with your suggestion that it doesn't need to
modify the data files.  Depending on how it goes about doing that, it
could mean a significant short-term increase in storage requirements
while the data is being converted.  

Going back to our recent 'reindex database' attempt, pgsql does not
necessarily do these things in the most storage-efficient manner; it
seems entirely likely that it would choose to use links to duplicate
the data directory, then create copies of each data file as it
converts them over, then link that back to the original for an atomic
replacement.  That could eat up a HUGE amount of storage during the
conversion process without the start and end sizes being very
different at all.  

I can't really think of a scenario in which this would be the best
(or even a good) strategy to convert the database. I am quite confident
that pg_upgrade doesn't do that at present and reasonably confident that
it won't do it in the future.

Sorry, but I can't reconcile your use of "as I understand it" with
your use of "definitely".  It sounds like you're guessing, rather than
speaking from direct knowledge of how the internals of pg_upgrade.

I don't have direct knowledge of the internals of pg_upgrade, but I
have upgraded a database of about 1 TB at least twice with --link. Since
I had much less than 1 TB of free space and the upgrade completed very
quickly, I am very confident that no user defined tables are copied. I
have also been on this mailing list for a few years and read quite a few
discussions about the usage of pg_upgrade in that time (though I may not
always have paid much attention to them).

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

#17Stephen Frost
sfrost@snowman.net
In reply to: Matthew Pounsett (#13)
Re: Rearchitecting for storage

Greetings,

* Matthew Pounsett (matt@conundrum.com) wrote:

On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:

Okay. So I guess the short answer is no, nobody really knows how to
judge how much space is required for an upgrade? :)

As I understand it, a pg_upgrade --link uses only negligible extra
space. It duplicates a bit of householding information, but not your
data tables or indexes. Your 18 TB table will definitely not be duplicated
during the upgrade if you can use --link.

The documentation for pg_upgrade --link says that the old copy is no longer
usable,

That's not entirely true- the old copy is only no longer usable *after*
you've started the new version of the DB against those data files. If
you haven't started the new major version of PG yet, then you can go
back to using the old version against those files.

which means it's modifying files that are linked.

No, it doesn't.

If it were only
modifying small housekeeping files, then it would be most efficient not to
link those, which would keep both copies of the db usable.

The catalog tables *aren't* linked. Both copies of the DB are usable-
but only until you start the DB against one of the versions. Once
you've started either the old version or the new version, you can't
switch. If you started the old version, then you could do another
pg_upgrade, of course, but you can't use the new version as there will
have been changes made to the catalog tables and control file (which
aren't linked) that would have to be accounted for in the new version's
catalog by pg_upgrade.

That seems
incompatible with your suggestion that it doesn't need to modify the data
files. Depending on how it goes about doing that, it could mean a
significant short-term increase in storage requirements while the data is
being converted.

No, that's not the case- link mode doesn't copy the data files, it just
rebuilds the catalog tables and fixes up things in the new database
cluster (clog, wal, et al, not the user data tables/indexes).

Going back to our recent 'reindex database' attempt, pgsql does not
necessarily do these things in the most storage-efficient manner; it seems
entirely likely that it would choose to use links to duplicate the data
directory, then create copies of each data file as it converts them over,
then link that back to the original for an atomic replacement. That could
eat up a HUGE amount of storage during the conversion process without the
start and end sizes being very different at all.

No, that isn't how pg_upgrade works.

Thanks,

Stephen

#18Luca Ferrari
fluca1978@gmail.com
In reply to: Matthew Pounsett (#9)
Re: Rearchitecting for storage

On Fri, Jul 19, 2019 at 4:41 PM Matthew Pounsett <matt@conundrum.com> wrote:

My current backup plan for this database is on-site replication, and a monthly pg_dump from the standby to be copied off-site. Doing per-table backups sounds like a great way to end up with an inconsistent backup, but perhaps I misunderstand what you mean.

Well, my idea was that, as I was supposing and you confirmed, the
database is full also of historical data, that will not be updated in
the future. Therefore you could at least perform a partitioning,
backup historical data and, after having verified it, drop historical
data. This will lead you to have a "partially" online system (I mean
partially because it will not have let's say the last 20 years, but
only the last 10 years) and you will save space for upgrading. After
the upgrade is completed, you can restore the oldest data and you will
come back online with the full dataset. IN this scenario the backup is
not inconsistent, since old data is supposed to stay frozen. If this
is not true, my idea is completly wrong.
I know, this is the desperate-poor-man approach, but I have quite
frankly no other ideas if you are constrained on space, money and time
(because as much as you take, the much it becomes harder to upgrade,
in my opinion).
I would also inspect _now_ a possible refactoring of the database in
order to gain, if possible, some extra space. I mean, sorry to be
harsh, but a database with a huge large table has not been designed
efficiently, so chances are some columns can be shrinked (to the
correct data type could be?) and this could provide you some extra
space.
But without having a better understanding of the scenario and the
context, I think I cannot help very much.

Sorry,
Luca

#19Luca Ferrari
fluca1978@gmail.com
In reply to: Matthew Pounsett (#3)
Re: Rearchitecting for storage

On Thu, Jul 18, 2019 at 10:09 PM Matthew Pounsett <matt@conundrum.com> wrote:

That would likely keep the extra storage requirements small, but still non-zero. Presumably the upgrade would be unnecessary if it could be done without rewriting files. Is there any rule of thumb for making sure one has enough space available for the upgrade? I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail. For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage. Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.

I've done a test on a virtual machine of mine, with the following
three databases: one 0f 4.9 GB, one of 500 MB, one of 50 MB. I know
this is not even close to your environment, however upgrading with
pg_upgrade from 10.9 to 11.4 _without_ the link option ask for 85% of
space.

On a machine with a single database of 8.9 GB and a space occupation,
as reported by df, of 64% (mean 46% available) I was able to upgrade
from 10.9 to 11.4 without the link option. Space occupation increased
of 90%.
Using the link option on the same cluster required 1.1% of extra space
(around 100 MB).
Of course, these are poor-man results, but give you an advice on the
space required by pg_ugprade (which seems to be less than 100% or 2x).

Hope this helps.
Luca