How to store "blobs" efficiently for small and large sizes, with random access
Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.
I know about bytea versus lo, and I have probably read most of what's
out there about them :)
Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).
Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.
I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...
PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?
Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?
At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.
Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).
So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.
Thanks, --DD
PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...
First advice, don't do it. We started off storing blobs in DB for “TX safety”,
but backup/restore quickly became too cumbersome so we ended up moving all
blobs out and only store reference in DB. This required us to make a “vacuum
system” that cleans up the blob-storage regularly as ROLLBACK/crash can make it
out of sync.
We chose storing as LO because with it, streaming large blobs (not using much
memory) actually worked, with JDBC at least.
På onsdag 19. oktober 2022 kl. 11:47:59, skrev Dominique Devienne <
ddevienne@gmail.com <mailto:ddevienne@gmail.com>>:
Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.
I know about bytea versus lo, and I have probably read most of what's
out there about them :)
Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).
Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.
I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...
PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?
Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?
At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.
Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).
So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.
Thanks, --DD
PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com>
wrote:
First advice, don't do it. We started off storing blobs in DB for “TX
safety”
Not really an option, I'm afraid.
, but backup/restore quickly became too cumbersome so we ended up moving
all blobs out and only store reference in DB.
This required us to make a “vacuum system” that cleans up the blob-storage
regularly as ROLLBACK/crash can make it out of sync.
Note the fact the data is spread in many mostly independent schemas.
Might ease the backup/restore? I'm not much of a DBA though...
We chose storing as LO because with it, streaming large blobs (not using
much memory) actually worked, with JDBC at least.
I'm in C++, with I believe efficient use of binary binds and results, and
use of COPY as much as possible,
so as good as it gets I think (that I know of, with libpq), in terms of
performance. Haven't looked at libpq's
new in v14 pipeline mode yet though.
Investigated Cursor vs Statement too, and it's a tradeoff between latency
and throughput.
Not sure how good JDBC drivers for PostgreSQL are. I'm doing my best to go
fast with libpq.
In any case, thanks for your input. But it's not really a question of "if".
But of "how".
Putting thousands of large blobs in the file system is a no go. Assuming
the clients
can even see the file system the server sees. This is a 2-tier system,
there's no mid-tier
that would somehow magically handle proper security and lifetime management
of these blobs.
Thanks, --DD
På onsdag 19. oktober 2022 kl. 12:48:24, skrev Dominique Devienne <
ddevienne@gmail.com <mailto:ddevienne@gmail.com>>:
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com
<mailto:andreas@visena.com>> wrote:
First advice, don't do it. We started off storing blobs in DB for “TX safety”
Not really an option, I'm afraid.
, but backup/restore quickly became too cumbersome so we ended up moving all
blobs out and only store reference in DB.
This required us to make a “vacuum system” that cleans up the blob-storage
regularly as ROLLBACK/crash can make it out of sync.
Note the fact the data is spread in many mostly independent schemas.
Might ease the backup/restore? I'm not much of a DBA though...
We chose storing as LO because with it, streaming large blobs (not using much
memory) actually worked, with JDBC at least.
I'm in C++, with I believe efficient use of binary binds and results, and use
of COPY as much as possible,
so as good as it gets I think (that I know of, with libpq), in terms of
performance. Haven't looked at libpq's
new in v14 pipeline mode yet though.
Investigated Cursor vs Statement too, and it's a tradeoff between latency and
throughput.
Not sure how good JDBC drivers for PostgreSQL are. I'm doing my best to go
fast with libpq.
In any case, thanks for your input. But it's not really a question of "if".
But of "how".
Putting thousands of large blobs in the file system is a no go. Assuming the
clients
can even see the file system the server sees. This is a 2-tier system, there's
no mid-tier
that would somehow magically handle proper security and lifetime management of
these blobs.
Thanks, --DD
Ok, just something to think about; Will your database grow beyond 10TB with
blobs? If so try to calculate how long it takes to restore, and comply with
SLA, and how long it would have taken to restore without the blobs.
PS: Our blobstore is not “the file system”, but SeaweedFS.
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
Ok, just something to think about;
Thank you. I do appreciate the feedback.
Will your database grow beyond 10TB with blobs?
The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...
If so try to calculate how long it takes to restore, and comply with SLA,
and how long it would have taken to restore without the blobs.
Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.
Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.
Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD
På onsdag 19. oktober 2022 kl. 13:21:38, skrev Dominique Devienne <
ddevienne@gmail.com <mailto:ddevienne@gmail.com>>:
On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh <andreas@visena.com>
wrote:
Ok, just something to think about;
Thank you. I do appreciate the feedback.
Will your database grow beyond 10TB with blobs?
The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...
If so try to calculate how long it takes to restore, and comply with SLA,
and how long it would have taken to restore without the blobs.
Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.
I'm not saying you don't need backup (or redundancy) of other systems holding
blobs, but moving them out of RDBMS makes you restore the DB to a consistent
state, and able to serve clients, faster. In my experience It's quite unlikely
that your (redundant) blob-store needs crash-recovery at the same time you DB
does. The same goes with PITR, needed because of some logical error (like
client deleted some data they shouldn't have), which is much faster without
blobs in DB and doesn't affect the blobstore at all (if you have a smart
insert/update/delete-policy there).
Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.
Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DD
There's a reason “everybody” advices to move blobs out of DB, I've learned.
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>
On Wed, Oct 19, 2022 at 1:38 PM Andreas Joseph Krogh <andreas@visena.com>
wrote:
There's a reason “everybody” advices to move blobs out of DB, I've learned.
I get that. I really do. But the alternative has some real downsides too.
Especially around security, as I already mentioned. That's why I'd like if
possible
to get input on the technical questions of my initial post.
That's not to say we wouldn't ultimately move out the big blobs outside the
DB.
But given how much that would complexify the project, I do believe it is
better
to do it as a second step, once the full system is up-and-running and
testing at
scale has actually been performed.
We've already moved other kind of data to PostgreSQL, from SQLite DBs
(thousands) this time,
and ported "as-is" the sharding done on the SQLite side to PostgreSQL
(despite TOAST).
And so far, so good. With good ingestion rates. And decent runtime access
to data too,
in the albeit limited testing we've had so far.
Now we need to move this other kind of data, from proprietary DB-like files
this times (thousands too),
to finish our system, and be able to finally test the whole system in
earnest, and at (our limited internal) scale.
So you see, I'm not completely ignoring your advise.
But for now, I'm inquiring as to the *best* way to put that data *in*
PostgreSQL,
with the requirements / constraints I've listed in the first post.
It may indeed be a bad idea long term. But let's make the most of it for
now.
Makes sense? Am I being unreasonable here? --DD
We had the same thought of storing the blobs inside LO’s as well many years ago.
But ultimately chose cloud storage and stored a pointer in the database instead.
Now that we are approaching a terabyte of just normal data I don’t regret this decision one bit. Just handling backups and storage is already a chore.
Data in S3 compatible storage is very easy to protect in numerous ways.
We have one set of code responsible for uploading, downloading and deleting the files themselves.
One downside? Occasionally an S3 delete fails and now and again a file or two gets orphaned. But we’ve never not found a file pointed to from our attachments table in 11 years.
We also only store pathnames/base names so we can easily move storage providers if we decide to go on Prem.
There is absolutely no upside to storing files in the db if you anticipate any kind of growth or significant volume.
Ericson Smith
CTO
Travel Agency Tribes
Sent from my iPhone
Show quoted text
On 19 Oct 2022, at 7:01 PM, Dominique Devienne <ddevienne@gmail.com> wrote:
On Wed, Oct 19, 2022 at 1:38 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
There's a reason “everybody” advices to move blobs out of DB, I've learned.
I get that. I really do. But the alternative has some real downsides too.
Especially around security, as I already mentioned. That's why I'd like if possible
to get input on the technical questions of my initial post.That's not to say we wouldn't ultimately move out the big blobs outside the DB.
But given how much that would complexify the project, I do believe it is better
to do it as a second step, once the full system is up-and-running and testing at
scale has actually been performed.We've already moved other kind of data to PostgreSQL, from SQLite DBs (thousands) this time,
and ported "as-is" the sharding done on the SQLite side to PostgreSQL (despite TOAST).
And so far, so good. With good ingestion rates. And decent runtime access to data too,
in the albeit limited testing we've had so far.Now we need to move this other kind of data, from proprietary DB-like files this times (thousands too),
to finish our system, and be able to finally test the whole system in earnest, and at (our limited internal) scale.So you see, I'm not completely ignoring your advise.
But for now, I'm inquiring as to the *best* way to put that data *in* PostgreSQL,
with the requirements / constraints I've listed in the first post.
It may indeed be a bad idea long term. But let's make the most of it for now.
Makes sense? Am I being unreasonable here? --DD
Dominique Devienne wrote:
the fact the lo table is unique for the whole database would allow
users to see blobs from any schema, as I understand it.
Direct access to pg_largeobject is only possible for superusers.
If lo_compat_privileges is on, any user can read any large
object with the lo* functions.
If it's off, they can read a large object only if they're the owner
or they have been granted permissions with
GRANT { { SELECT | UPDATE } [, ...] | ALL [ PRIVILEGES ] }
ON LARGE OBJECT loid [, ...]
TO role_specification [, ...] [ WITH GRANT OPTION ]
[ GRANTED BY role_specification ]
Each large object has its own set of permissions. This is a significant
difference with bytea, since every creation of a new large object
may need to be followed by GRANT statements.
Also if the roles and the access policies are changed in the
lifetime of the app, that might imply massive REVOKE/GRANT
statements to apply to existing objects.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On 2022-Oct-19, Dominique Devienne wrote:
Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store blobs in many different schemas, and users
can access some schemas, and not others. So the fact the lo table is
unique for the whole database would allow users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
Generally speaking, bytea sucks for random access, because if a TOAST
item is compressed, it has to be always read from the beginning in order
to decompress correctly. However, if you set
ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL
then compression is not used, and random access becomes fast.
https://www.postgresql.org/docs/15/sql-altertable.html
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"
On Wed, Oct 19, 2022 at 3:01 PM Daniel Verite <daniel@manitou-mail.org> wrote:
Dominique Devienne wrote:
the fact the lo table is unique for the whole database would allow
users to see blobs from any schema, as I understand it.
Each large object has its own set of permissions. This is a significant
difference with bytea, since every creation of a new large object
may need to be followed by GRANT statements.
Also if the roles and the access policies are changed in the
lifetime of the app, that might imply massive REVOKE/GRANT
statements to apply to existing objects.
Thank you Daniel. Very interesting, and something I definitely didn't know.
I believe that's doable, given our design on ROLEs, but would for sure
be both a PITA, and
additional management / code to deal with. At least GRANTs are
transactional like the new
LO oids themselves, I think, so now I know it would be possible to
properly secure the LOs.
This insight is greatly appreciated. --DD
On Wed, Oct 19, 2022 at 3:05 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Oct-19, Dominique Devienne wrote:
OTOH, lo has random access, which I also need...
Generally speaking, bytea sucks for random access, because if a TOAST
item is compressed, it has to be always read from the beginning in order
to decompress correctly. However, if you set
ALTER TABLE tab ALTER COLUMN byt SET STORAGE EXTERNAL
then compression is not used, and random access becomes fast.
Thank you Álvaro. Quite insightful as well. I was aware of TOAST compression,
but didn't make the connection to the side-effects on random-access.
But now that TOAST has LZ4 support, which decompresses extremely fast,
compared to ZLib (I have experience with LZ4 for a WebSocket-based
server messages),
and choosing an appropriately small shard/chunk size, that might be
mitigated somewhat.
Would need testing / benchmarking to compare uncompressed vs LZ4, at
various chunk
and subset/offset sizes, of course.
Anybody has an answer to my question regarding how substr() works on
bytea values?
I.e. is it "pushed down" / optimized enough that it avoids reading the
whole N-byte value,
to then pass it to substr(), which then returns an M-byte value (where M < N)?
If TOAST stores 2,000 chunks, and those chunks' PKs are numbers,
simple arithmetic
should be able to select only the chunks of interest, those incurring
only the necessary IO
for the selected range, no?
Or the fact subsetting a bytea currently requires substr() prevents
using such a scenario?
And if so, why not support a native subsetting notation that did
support that scenario,
like the obvious bytea_col[offset, count] or bytea_col[start:end]?
Seems to be me efficient native subsetting of varlength values would
be quite valuable.
Dominique Devienne wrote:
PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
Bytea contents are compressed before being sliced (in chunks of
TOAST_MAX_CHUNK_SIZE bytes, typically it's 2000 IIRC), so it's not
possible to access a piece of data without decompressing the contents
before it.
By contrast large objects are sliced before compression, so the
performance of random access is likely to be completely different.
Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.
In your case I would probably opt for bytea (as opposed to large
objects), and slicing the blobs in the application in chunks of a
fixed size much larger than what TOAST does (for instance, somewhere
between 128 kB and 32 MB).
That is, having a table like:
create table blobs (
blob_id some_type,
chunk_no int, /* 0->N */
chunk bytea
);
It's not as simple as using a single bytea field or large objects,
but overall it avoids the management difficulties of both large
objects and very large contents in bytea columns.
Random access is achieved by skipping the chunks before the
requested piece of data.
If the app is able to read/write the chunks in binary mode, its queries
should perform as well as the large objects functions.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On 10/19/22 04:47, Dominique Devienne wrote:
[snip]
PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
Partition the table on the first segment of the Primary Key. From
experience, anything else can lead to serious query degradation.
--
Angular momentum makes the world go 'round.
On 10/19/22 06:38, Andreas Joseph Krogh wrote:
På onsdag 19. oktober 2022 kl. 13:21:38, skrev Dominique Devienne
<ddevienne@gmail.com>:On Wed, Oct 19, 2022 at 1:00 PM Andreas Joseph Krogh
<andreas@visena.com> wrote:Ok, just something to think about;
Thank you. I do appreciate the feedback.
Will your database grow beyond 10TB with blobs?
The largest internal store I've seen (for the subset of data that goes
in the DB) is shy of 3TB.
But we are an ISV, not one of our clients, which have truly massive
scale for data.
And they don't share the exact scale of their proprietary data with me...If so try to calculate how long it takes to restore, and comply with
SLA,
and how long it would have taken to restore without the blobs.
Something I don't quite get is why somehow backup is no longer needed
if the large blobs are external?
i.e. are you saying backups are so much more worse in PostgreSQL than
with the FS? I'm curious now.I'm not saying you don't need backup (or redundancy) of other systems
holding blobs, but moving them out of RDBMS makes you restore the DB to a
consistent state, and able to serve clients, faster. In my experience It's
quite unlikely that your (redundant) blob-store needs crash-recovery at
the same time you DB does. The same goes with PITR, needed because of some
logical error (like client deleted some data they shouldn't have), which
is much faster without blobs in DB and doesn't affect the blobstore at all
(if you have a smart insert/update/delete-policy there).
This is nothing to sneeze at. Backing up a 30TB database takes a *long* time
Also, managing the PostgreSQL server will be the client's own concern
mostly. We are not into Saas here.
As hinted above, the truly massive data is already not in the DB, used
by different systems, and processed
down to the GB sized inputs all the data put in the DB is generated
from. It's a scientific data heavy environment.
And one where security of the data is paramount, for contractual and
legal reasons. Files make that harder IMHO.Anyways, this is straying from the main theme of this post I'm afraid.
Hopefully we can come back on the main one too. --DDThere's a reason “everybody” advices to move blobs out of DB, I've learned.
We deal with an ISV maintaining a banking application. It stores scanned
images of checks as bytea fields in a Postgresql 9.6 database. The next
version will store the images outside of the database.
--
*Andreas Joseph Krogh*
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com
www.visena.com <https://www.visena.com>
<https://www.visena.com>
--
Angular momentum makes the world go 'round.
Attachments:
On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
First advice, don't do it. We started off storing blobs in DB for “TX safety”
Not really an option, I'm afraid.
You should reconsider. Ruling out that option now might get you into trouble
later. Large Objects mean trouble.
Yours,
Laurenz Albe
On 2022-Oct-19, Dominique Devienne wrote:
Anybody has an answer to my question regarding how substr() works on
bytea values? I.e. is it "pushed down" / optimized enough that it
avoids reading the whole N-byte value, to then pass it to substr(),
which then returns an M-byte value (where M < N)?If TOAST stores 2,000 chunks, and those chunks' PKs are numbers,
simple arithmetic should be able to select only the chunks of
interest, those incurring only the necessary IO for the selected
range, no?
That's exactly what I was trying to say. If there's no compression, we
don't read prior chunks. (This is valid for bytea, at least; for
textual types we have to worry about multibyte characters, which are
again a potential source of confusion regarding the exact location you
want to seek.)
This can be seen in detoast_attr_slice() in
src/backend/access/common/detoast.c, though there are way too many^W^W^W
multiple layers of indirection if you start from bytea_substr() in
varlena.c.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Wed, Oct 19, 2022 at 4:29 PM Daniel Verite <daniel@manitou-mail.org> wrote:
In your case I would probably opt for bytea (as opposed to large
objects), and slicing the blobs in the application in chunks
Thanks for the advice, and the valuable info on LO permissions. --DD
On Wed, Oct 19, 2022 at 5:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
That's exactly what I was trying to say.
If there's no compression, we don't read prior chunks.
Great to read that. I'll probably try to benchmark w/ and w/o
compression eventually.
Need to deal with other issues first, will take a while to report back on this.
This can be seen in detoast_attr_slice()
Thank you for the pointer. I'll be sure to have a look at that code.
And thanks again for chiming in, with very useful info Alvaro.
On Wed, Oct 19, 2022 at 5:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2022-10-19 at 12:48 +0200, Dominique Devienne wrote:
On Wed, Oct 19, 2022 at 12:17 PM Andreas Joseph Krogh <andreas@visena.com> wrote:
First advice, don't do it. We started off storing blobs in DB for “TX safety”
Not really an option, I'm afraid.
You should reconsider. Ruling out that option now might get you into trouble
later. Large Objects mean trouble.
Andreas, Ericson, Laurenz, thanks for the advice.
I'll be sure to discuss these concerns with the team.
We have other (bigger) data in the file system, albeit more of a
read-only nature though perhaps.
And this is an area I'm not familiar with how security is handled, so
I'll investigate it to see if a path
forward to externalize the largish blobs (currently destined to live
in the DB) is possible.
So I hope you can see I'm not dismissing what you guys are saying.
But before I finish this thread for now, I'd like to add that I
consider unfortunate a state of affairs where
NOT putting the data in the DB is the mostly agreed upon advice. It
IMHO points to a weak point of
PostgreSQL, which does not invest in those use-cases with large data,
perhaps with more file-system
like techniques. Probably because most of the large users of
PostgreSQL are more on the "business"
side (numerous data, but on the smaller sizes) than the "scientific"
side, which (too often) uses files and
files-in-a-file formats like HDF5.
FWIW, when Oracle introduced SecureFile blobs years ago in v11, it
represented a leap forward in
performance, and back then we were seeing them being 3x faster than LO
at GB sizes, if I recall correctly,
with throughput that challenged regular networked file-system like
NFS. That was over 10 years ago,
so who knows where we are now. And from the posts here, the issues
with large blobs may be more
related to backup/restore perhaps, than runtime performance.
Having all the data in the DB, under a single security model, is a big
win for consistency and simplicity.
And the fact it's not really possible now is a pity, in my mind. My
(probably uninformed) opinion on this
is the large blobs are handled just like other relational data, in
paged storage designed for smaller data.
I.e. file-like blobs are shoehorned into structures which are
inappropriate for them, and that a rethink
and redesign is necessary specifically for them, similar to the Oracle
SecureFile one of old.
I have similar gripes with SQLite, which is otherwise a fantastic
embedded DB. Just see how the
SQLite-based Fossil-SCM fails to scale for very large repo with big
(e.g. game) assets, and how it
similarly failed to scale in SVN a long time ago, to be replaced by a
forest-of-files (which GIT also uses).
DBs like PostgreSQL and SQLite should be better at this. And I hope
they get there eventually.
Sorry to turn a bit philosophical at this. It's not a critic per-se.
More of the personal musing of a
dev in this space for a long time. FWIW. Thanks, --DD