Quesion about querying distributed databases

Started by me nefcantoover 1 year ago27 messagesgeneral

sn.1361@gmail.com

over 1 year ago

Hello

Consider this scenario:

- 3 servers, 3 databases, each on a separate server:
- *Products database*: Contains the *Products* table (with over
100,000 records).
- *Taxonomy database*: Contains the *Categories* and *ItemCategories
(EAV)* tables.
- *Attributes database*: Contains the *Attributes* and *ItemAttributes
(EAV)* tables.

How do you find products based on the following criteria?

1. A search in the title (e.g., "awesome shirts").
2. Selected categories (e.g., "casual" and "sports").
3. Selected attributes (e.g., "color: blue" and "size: large")

Regards
Saeed

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: me nefcanto (#1)

Re: Quesion about querying distributed databases

On 3/4/25 20:40, me nefcanto wrote:

Hello

Consider this scenario:

* 3 servers, 3 databases, each on a separate server:
o *Products database*: Contains the *Products* table (with over
100,000 records).
o *Taxonomy database*: Contains the *Categories* and
*ItemCategories (EAV)* tables.
o *Attributes database*: Contains the *Attributes* and
*ItemAttributes (EAV)* tables.

How do you find products based on the following criteria?

https://www.postgresql.org/docs/current/postgres-fdw.html

1. A search in the title (e.g., "awesome shirts").
2. Selected categories (e.g., "casual" and "sports").
3. Selected attributes (e.g., "color: blue" and "size: large")

Regards
Saeed

--
Adrian Klaver
adrian.klaver@aklaver.com

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Adrian Klaver (#2)

Re: Quesion about querying distributed databases

Adrian Klaver, thank you for the link. I asked the AI to create a query for
me using FDW.

This is the sample query:

with filtered_products as (
select p.product_id
from products.product p
where p.title ilike '%search_term%'
), category_filtered as (
select ic.product_id
from taxonomy.item_categories ic
where ic.category_id = any(array['category_id_1', 'category_id_2'])
), attribute_filtered as (
select ia.product_id
from attributes.item_attributes ia
where ia.attribute_id = any(array['attribute_id_1', 'attribute_id_2'])
), final_products as (
select f.product_id
from filtered_products f
join category_filtered c on f.product_id = c.product_id
join attribute_filtered a on f.product_id = a.product_id
order by f.product_id -- replace with relevant sorting column
limit 50 offset 0
)
select p.*
from products.product p
join final_products fp on p.product_id = fp.product_id;

The problem here is that it collects all of the product_id values from the
ItemCategories table. Let's say each product is put in one category only.
This means that we have 100 thousand records in the ItemCategories table.
Thus, to show a list of 20 products on the website, this query first
fetches 100 thousand product_id values from the remote server.

That's not scalable. Is there a workaround for this?

Thank you
Saeed

On Wed, Mar 5, 2025 at 8:12 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 3/4/25 20:40, me nefcanto wrote:

Hello

Consider this scenario:

* 3 servers, 3 databases, each on a separate server:
o *Products database*: Contains the *Products* table (with over
100,000 records).
o *Taxonomy database*: Contains the *Categories* and
*ItemCategories (EAV)* tables.
o *Attributes database*: Contains the *Attributes* and
*ItemAttributes (EAV)* tables.

How do you find products based on the following criteria?

https://www.postgresql.org/docs/current/postgres-fdw.html

1. A search in the title (e.g., "awesome shirts").
2. Selected categories (e.g., "casual" and "sports").
3. Selected attributes (e.g., "color: blue" and "size: large")

Regards
Saeed

--
Adrian Klaver
adrian.klaver@aklaver.com

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#3)

Re: Quesion about querying distributed databases

On Wed, 2025-03-05 at 10:12 +0330, me nefcanto wrote:

Adrian Klaver, thank you for the link. I asked the AI to create a query for me using FDW.

The problem here is that it collects all of the product_id values from the ItemCategories table [...]

That's not scalable. Is there a workaround for this?

Without having scrutinized the case in detail: if your data are organized in an
entity-attribute-value design and distributed across three databases, you cannot
expect to end up with efficient queries.

Perhaps you can extract the data and load them into a reasonably organized
single database. Such an ETL process might make the task much easier.

Yours,
Laurenz Albe

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#4)

Re: Quesion about querying distributed databases

Laurenz Albe, thanks for your answer.

Right now this data is in MariaDB, on separate databases (schema) but on
one server. The solution in this situation is to have a cross-database
query. (this is the status quo of our application)

Now our team has decided to migrate to Postgres. However, we realized that
Postgres does not support cross-database queries. And if we want to do so,
we should use FDW. So, we thought we might as well put databases on
separate servers for scalability if we have to write more code. That's the
reason behind this question.

But we're stuck at performance. In SQL Server and MariaDB, cross-database
queries allow for neat separation of data while delivering good performance
in the orchestration layer. You have separate databases, which allows for
fine-grained management (different backup schedules, index recalculation,
deployment, etc.) but at the same time you can write a query in your app,
or in an orchestrator database (let's call it All) that is fast enough for
millions of records.

However, we're stuck in this in Postgres. What solutions exist for this
problem?

Regards
Saeed

On Wed, Mar 5, 2025 at 11:09 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Wed, 2025-03-05 at 10:12 +0330, me nefcanto wrote:

Adrian Klaver, thank you for the link. I asked the AI to create a query

for me using FDW.

The problem here is that it collects all of the product_id values from

the ItemCategories table [...]

That's not scalable. Is there a workaround for this?

Without having scrutinized the case in detail: if your data are organized
in an
entity-attribute-value design and distributed across three databases, you
cannot
expect to end up with efficient queries.

Perhaps you can extract the data and load them into a reasonably organized
single database. Such an ETL process might make the task much easier.

Yours,
Laurenz Albe

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#5)

Re: Quesion about querying distributed databases

On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:

Right now this data is in MariaDB, on separate databases (schema) but on one
server. The solution in this situation is to have a cross-database query.
(this is the status quo of our application)

Now our team has decided to migrate to Postgres. However, we realized that
Postgres does not support cross-database queries. And if we want to do so,
we should use FDW. So, we thought we might as well put databases on separate
servers for scalability if we have to write more code. That's the reason
behind this question.

In MySQL, the terms "database" and "schema" are used for the same thing.
Not so in PostgreSQL. I think you should migrate the data into different
schemas in a single database, pretty much like you had it in MySQL.
Then you don't need a foreign data wrapper, and I bet the query can
perform as well as it did on MySQL.

Yours,
Laurenz Albe

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#6)

Re: Quesion about querying distributed databases

Dear Laurenz,

That means a solid monolith database. We lose many goodies with that. As a
real-world example, right now we can import a single database from the
production to the development to test and troubleshoot data.

What if we host all databases on the same server and use FDW. What happens
in that case? Does it return 100 thousand records and join in the memory?

Because in SQL Server, when you perform a cross-database query (not
cross-server) the performance is extremely good, proving that it does not
return 100 thousand ItemId from Taxonomy.ItemCategories to join with
ProductId.

Is that the same case with Postgres too, If databases are located on one
server?

Regards
Saeed

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#7)

Re: Quesion about querying distributed databases

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:

That means a solid monolith database. We lose many goodies with that.
As a real-world example, right now we can import a single database
from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

What if we host all databases on the same server and use FDW. What
happens in that case? Does it return 100 thousand records and join
in the memory?

It will do just the same thing. The performance could be better
because of the reduced latency.

Because in SQL Server, when you perform a cross-database query
(not cross-server) the performance is extremely good, proving that
it does not return 100 thousand ItemId from Taxonomy.ItemCategories
to join with ProductId.

Is that the same case with Postgres too, If databases are located
on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper. I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#8)

Re: Quesion about querying distributed databases

Dear Laurenz, the point is that I think if we put all databases into one
database, then we have blocked our growth in the future.
A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am wrong.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind, is to
store the ID of the categories in the Products table. This means that I
don't need FDW anymore. And databases can be on separate machines. I first
query the categories database first, get the category IDs, and then add a
where clause to limit the product search. That could be an option. Array
data type in Postgres is something that I think other RDBMSs do not have.
Will that work? And how about attributes? Because attributes are more than
a single ID. I should store the attribute key, alongside its value. It's a
key-value pair. What can I do for that?

Thank you for sharing your time. I really appreciate it.
Saeed

On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:

That means a solid monolith database. We lose many goodies with that.
As a real-world example, right now we can import a single database
from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

What if we host all databases on the same server and use FDW. What
happens in that case? Does it return 100 thousand records and join
in the memory?

It will do just the same thing. The performance could be better
because of the reduced latency.

Because in SQL Server, when you perform a cross-database query
(not cross-server) the performance is extremely good, proving that
it does not return 100 thousand ItemId from Taxonomy.ItemCategories
to join with ProductId.

Is that the same case with Postgres too, If databases are located
on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper. I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

#10

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#9)

Re: Quesion about querying distributed databases

On Wed, 2025-03-05 at 15:45 +0330, me nefcanto wrote:

Dear Laurenz, the point is that I think if we put all databases into one database,
then we have blocked our growth in the future.

Hard to say.
If you want to shard for horizontal scaling, that usually only works well
if there are few interconnections between the different shards. If you end
up joining data from different shards, you usually lose.

A monolith database can be scaled only vertically. We have had huge headaches
in the past with SQL Server on Windows and a single database.

Without knowing what the headaches were, it is hard to answer something here.

But when you divide bounded contexts into different databases, then you have
the chance to deploy each database on a separate physical machine. That means
a lot in terms of performance. Please correct me if I am wrong.

I don't know if you are wrong. But it seems like it is not working well,
is it? Perhaps you can explain how splitting up the data might result in
better performance. Is that just a guess or do you have a reason to think so?

Let's put this physical restriction on ourselves that we have different
databases. What options do we have?

I don't know. Based on what you showed: perhaps a complete re-design?

Yours,
Laurenz Albe

#11

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: me nefcanto (#9)

Re: Quesion about querying distributed databases

On Wed, Mar 5, 2025 at 7:15 AM me nefcanto <sn.1361@gmail.com> wrote:

I think if we put all databases into one database, then we have blocked
our growth in the future.

I think this is premature optimization. Your products table has 100,000
rows. That's very tiny for the year 2025. Try putting everything on one box
with good indexes and you might be surprised at the performance.

A monolith database can be scaled only vertically.

Postgres scales well vertically. Plus, you can have streaming replicas to
distribute the read queries (like the one given here) across many machines.

We have had huge headaches in the past with SQL Server on Windows and a
single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance.

I get your concern, but if the data is inter-related, it really is best to
have them on the same server (and same database, and same schema). Then
Postgres can devise a really efficient plan. You can also use Citus to
start sharding things across multiple physical servers if your database
gets very large.

Let's put this physical restriction on ourselves that we have different

databases. What options do we have?

Your main option is FDW, which will never perform as well as a single
server. Plus, you have the additional headache of trying to coordinate data
updates atomically across different servers. The other option is to have
the application do the work, e.g. pull a list of things from one server,
use that to build a query against another one, etc. Definitely not ideal.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#12

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: me nefcanto (#9)

Re: Quesion about querying distributed databases

On 3/5/25 04:15, me nefcanto wrote:

Dear Laurenz, the point is that I think if we put all databases into one
database, then we have blocked our growth in the future.

How?

A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am wrong.

And you add the complexity of talking across machines, as well as
maintaining separate machines.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind, is
to store the ID of the categories in the Products table. This means that
I don't need FDW anymore. And databases can be on separate machines. I
first query the categories database first, get the category IDs, and
then add a where clause to limit the product search. That could be an
option. Array data type in Postgres is something that I think other
RDBMSs do not have. Will that work? And how about attributes? Because
attributes are more than a single ID. I should store the attribute key,
alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and
experiment. My bet is a single server with a single database and
multiple schemas is where you end up, after all that is where you are
starting from.

Thank you for sharing your time. I really appreciate it.
Saeed

On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:

That means a solid monolith database. We lose many goodies with that.
As a real-world example, right now we can import a single database
from the production to the development to test and troubleshoot data.

Well, can't you import a single schema then?

What if we host all databases on the same server and use FDW. What
happens in that case? Does it return 100 thousand records and join
in the memory?

It will do just the same thing. The performance could be better
because of the reduced latency.

Because in SQL Server, when you perform a cross-database query
(not cross-server) the performance is extremely good, proving that
it does not return 100 thousand ItemId from Taxonomy.ItemCategories
to join with ProductId.

Is that the same case with Postgres too, If databases are located
on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper. I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

--
Adrian Klaver
adrian.klaver@aklaver.com

#13

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Adrian Klaver (#12)

Re: Quesion about querying distributed databases

I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare. Almost everybody knows that scaling up
has a ceiling, but scaling out has no boundaries.

Therefore I will never choose a monolithic database design unless it's a
small project. But my examples are just examples. We predict 100 million
records per year. So we have to design accordingly. And it's not just sales
records. Many applications have requirements that are cheap data but vast
in multitude. Consider a language-learning app that wants to store the
known words of any learner. 10 thousand learners each knowing 2 thousand
words means 20 million records. Convert that to 100 thousand learners each
knowing 7 thousand words and now you almost have a billion records. Cheap,
but necessary. Let's not dive into telemetry or time-series data.

We initially chose to break the database into smaller databases, because it
seemed natural for our modularized monolith architecture. And it worked
great for SQL Server. If you're small, we host them all on one server. If
you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing
data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database. But then it seems to me that in this
case, Mongo would become a better choice because I lose the relational
nature and normalization somehow. What drawbacks have you experienced in
that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 3/5/25 04:15, me nefcanto wrote:

Dear Laurenz, the point is that I think if we put all databases into one
database, then we have blocked our growth in the future.

How?

A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am

wrong.

And you add the complexity of talking across machines, as well as
maintaining separate machines.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind, is
to store the ID of the categories in the Products table. This means that
I don't need FDW anymore. And databases can be on separate machines. I
first query the categories database first, get the category IDs, and
then add a where clause to limit the product search. That could be an
option. Array data type in Postgres is something that I think other
RDBMSs do not have. Will that work? And how about attributes? Because
attributes are more than a single ID. I should store the attribute key,
alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and
experiment. My bet is a single server with a single database and
multiple schemas is where you end up, after all that is where you are
starting from.

Thank you for sharing your time. I really appreciate it.
Saeed

On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:

That means a solid monolith database. We lose many goodies with

that.

As a real-world example, right now we can import a single database
from the production to the development to test and troubleshoot

data.

Well, can't you import a single schema then?

What if we host all databases on the same server and use FDW. What
happens in that case? Does it return 100 thousand records and join
in the memory?

It will do just the same thing. The performance could be better
because of the reduced latency.

Because in SQL Server, when you perform a cross-database query
(not cross-server) the performance is extremely good, proving that
it does not return 100 thousand ItemId from

Taxonomy.ItemCategories

to join with ProductId.

Is that the same case with Postgres too, If databases are located
on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper. I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

--
Adrian Klaver
adrian.klaver@aklaver.com

#14

Igor Korot

ikorot01@gmail.com

over 1 year ago

In reply to: me nefcanto (#13)

Re: Quesion about querying distributed databases

Hi,

On Wed, Mar 5, 2025, 8:44 PM me nefcanto <sn.1361@gmail.com> wrote:

I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare. Almost everybody knows that scaling up
has a ceiling, but scaling out has no boundaries.

But then you did the backup incrementally correct?

That should not take the same amount of time...

Therefore I will never choose a monolithic database design unless it's a
small project. But my examples are just examples. We predict 100 million
records per year. So we have to design accordingly. And it's not just sales
records. Many applications have requirements that are cheap data but vast
in multitude. Consider a language-learning app that wants to store the
known words of any learner. 10 thousand learners each knowing 2 thousand
words means 20 million records. Convert that to 100 thousand learners each
knowing 7 thousand words and now you almost have a billion records. Cheap,
but necessary. Let's not dive into telemetry or time-series data.

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and
operates differently. What work good in one may not work good in another...

Thank you.

Show quoted text

We initially chose to break the database into smaller databases, because
it seemed natural for our modularized monolith architecture. And it worked
great for SQL Server. If you're small, we host them all on one server. If
you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of
storing data in ItemCategories and ItemAttributes, store them as arrays in
the relevant tables in the same database. But then it seems to me that in
this case, Mongo would become a better choice because I lose the relational
nature and normalization somehow. What drawbacks have you experienced in
that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 3/5/25 04:15, me nefcanto wrote:

Dear Laurenz, the point is that I think if we put all databases into

one

database, then we have blocked our growth in the future.

How?

A monolith database can be scaled only vertically. We have had huge
headaches in the past with SQL Server on Windows and a single database.
But when you divide bounded contexts into different databases, then you
have the chance to deploy each database on a separate physical machine.
That means a lot in terms of performance. Please correct me if I am

wrong.

And you add the complexity of talking across machines, as well as
maintaining separate machines.

Let's put this physical restriction on ourselves that we have different
databases. What options do we have? One option that comes to my mind,

is

to store the ID of the categories in the Products table. This means

that

I don't need FDW anymore. And databases can be on separate machines. I
first query the categories database first, get the category IDs, and
then add a where clause to limit the product search. That could be an
option. Array data type in Postgres is something that I think other
RDBMSs do not have. Will that work? And how about attributes? Because
attributes are more than a single ID. I should store the attribute key,
alongside its value. It's a key-value pair. What can I do for that?

You seem to be going out of the way to make your life more complicated.

The only way you are going to find an answer is set up test cases and
experiment. My bet is a single server with a single database and
multiple schemas is where you end up, after all that is where you are
starting from.

Thank you for sharing your time. I really appreciate it.
Saeed

On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz.albe@cybertec.at
<mailto:laurenz.albe@cybertec.at>> wrote:

On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:

That means a solid monolith database. We lose many goodies with

that.

As a real-world example, right now we can import a single

database

from the production to the development to test and troubleshoot

data.

Well, can't you import a single schema then?

What if we host all databases on the same server and use FDW.

What

happens in that case? Does it return 100 thousand records and

join

in the memory?

It will do just the same thing. The performance could be better
because of the reduced latency.

Because in SQL Server, when you perform a cross-database query
(not cross-server) the performance is extremely good, proving

that

it does not return 100 thousand ItemId from

Taxonomy.ItemCategories

to join with ProductId.

Is that the same case with Postgres too, If databases are located
on one server?

No, you cannot perform cross-database queries without a foreign
data wrapper. I don't see a reason why the statement shouldn't
perform as well as in SQL Server if you use schemas instead of
databases.

Yours,
Laurenz Albe

--
Adrian Klaver
adrian.klaver@aklaver.com

#15

Rob Sargent

robjsargent@gmail.com

over 1 year ago

In reply to: Igor Korot (#14)

Re: Quesion about querying distributed databases

On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server. If you're small, we host them all on one server. If you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to monolith without giving solutions on how to scale, is not helping. To be honest, I'm somehow disappointed by how the most advanced open source database does not support cross-database querying just like how SQL Server does. But if it doesn't, it doesn't. Our team should either drop it as a choice or find a way (by asking the experts who built it or use it) how to design based on its features. That's why I'm asking.

Cross-database on MSSQL is identical to cross schema on postgres. If you truly need cross server support (versus say beefier hardware) how did you come to choose postgres? The numbers you present are impressive but not unheard of on this list.

#16

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: me nefcanto (#13)

Re: Quesion about querying distributed databases

On Wed, Mar 5, 2025 at 9:44 PM me nefcanto <sn.1361@gmail.com> wrote:

I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare.

25 years ago (meaning *much* slower hardware), I managed a 1TB database.
Backups took about 4 hours. Could have gotten it down to two hours if I'd
wanted to use more tape drives.

Right now, I manage a 5TB database. Backups take 110 minutes, and that's
when using one channel for all IO, writing to not the fastest NAS, and
other 3+TB databases backing up to it at the same time.

Almost everybody knows that scaling up has a ceiling

And that ceiling is much, much higher than you think it is.

, but scaling out has no boundaries.

Except for complexity and fragility. I bet I could get good scaled up
performance out of the amount of hardware you're using to scale out.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#17

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Rob Sargent (#15)

Re: Quesion about querying distributed databases

I appreciate your time guys. Thank you very much.

Having 1 table per database per server is too ugly.

Our databases are not one table per database. They are mapped to DDD's
bounded contexts and usually by one table per domain entity.
For example, we have these databases:

- Contacts
- Courses
- Seo
- Payment
- Forms
- Geo
- Sales
- Media
- Taxonomy
- ...

These are the tables we have in the Contacts database:

- Addresses
- AddressTypes
- Attributes
- BankAccounts
- ContactContents
- Contacts
- Emails
- Genders
- JobTitles
- JuridicalPersons
- NaturalPersonRelations
- NaturalPersons
- Persons
- Phones
- PhoneTypes
- Relations
- RelationTypes
- SocialNetworks
- SocialProfiles
- Titles

And, these are the tables we have in the Geo database:

- AdministrativeDivisions
- AdministrativeDivisionTypes
- Cities
- CityDivisions
- Countries
- Locations
- SpatialDataItems
- TelephonePrefixes
- TimeZones

But we also do have databases that only have one table in them. The number
of tables is not our criteria to break them. The business semantics is our
criteria.

Cross-database on MSSQL is identical to the cross schema on Postgres.

Cross-database query in SQL Server is not equivalent to cross-schema
queries in Postgres. Because SQL Server also has the concept of schemas. In
other words, both SQL Server and Postgres let you create databases, create
schemas inside them, and create tables inside schemas. So SQL Server's
cross-schema query equals Postgres's cross-schema query.

If you truly need cross server support (versus say beefier hardware) how

did you come to choose postgres?

We chose Postgres for these reasons that we did R&D about:

- Native array per column support
- Not having multiple storage engines like MariaDB to be confused about
- Supporting expressions in unique constraints
- It's usually considered one of the best when it comes to performance,
especially in GIS we intend to develop more upon
- As it claims on its website, it's the most advanced open-source
database engine (but to be honest, we saw many serious drawbacks to that
statement)

But here's the deal. We don't have one project only. We don't need
*cross-server
queries* for all of our projects. But we tend to keep our architecture the
same across projects as much as we can. We chose Postgres because we had
experience with SQL Server and MariaDB and assumed that cross-database
query on the same server is something natural. Both of them support that.
And both are very performant on that. On MariaDB all you have to do is to
use `db_name.table_name` and on SQL Server all you have to do is to use
`database_name.schema_name.table_name`. So we thought, for projects that do
not need more than one server, we keep databases on the same server. When
it needed more resources, we start by taking heavy databases onto their own
servers, and we start implementing table partitinong on them.

But we have experienced some amazing improvements too in our initial tests.
For example, creating all databases and tables and database objects on
MariaDB takes more than 400 seconds, while the same took 80 seconds on
Postgres. So amazing performance on DDL.
Also, 1 million records in bulk insertion take almost one-sixth to
on-fourth of the time on MariaDB. These are valuable numbers. They warmed
our hearts to keep digging as much as we can to see if we can perform this
migration.

Regards
Saeed

On Thu, Mar 6, 2025 at 7:14 AM Rob Sargent <robjsargent@gmail.com> wrote:

Show quoted text

On Mar 5, 2025, at 8:03 PM, Igor Korot jnit worked great for SQL Server.
If you're small, we host them all on one server. If you get bigger, we can
put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back
to monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

Cross-database on MSSQL is identical to cross schema on postgres. If you
truly need cross server support (versus say beefier hardware) how did you
come to choose postgres? The numbers you present are impressive but not
unheard of on this list.

#18

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#13)

Re: Quesion about querying distributed databases

On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:

I once worked with a monolithic SQL Server database with more than 10 billion
records and about 8 Terabytes of data. A single backup took us more than 21 days.
It was a nightmare. Almost everybody knows that scaling up has a ceiling, but
scaling out has no boundaries.

I hear you, and I agree with that.

We initially chose to break the database into smaller databases, because it
seemed natural for our modularized monolith architecture. And it worked great
for SQL Server. If you're small, we host them all on one server. If you get
bigger, we can put heavy databases on separate machines.

So you mean that you had those databases on different servers?
How would a cross-database query work in that case? It must be something
akin to foreign data in PostgreSQL.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers. Look at the execution plan you got on SQL Server
and see where PostgreSQL chooses a different plan. Then try to improve that.
We can try to help if we see actual plans.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Well, if you split the data into several databases, that *was* sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how
to design based on its features. That's why I'm asking.

Excluding options from the start is limiting yourself. Consider using
other, better databases than PostgreSQL (if you can find them).

It is difficult to come up with a concrete design based on the information
you provided. Perhaps you should get a consultant; the mailing list does
not seem to be the right format for that request.

Typically, you split the data in a ways that they have few interconnections,
for example per customer, so that you don't regularly end up joining data
from different databases (shards).

One thing that comes to my mind, is to use custom types. Instead of storing
data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database.

Don't ever store arrays in the database. It will be a nightmare.
You seem to be drawn to questionable data design...

Yours,
Laurenz Albe

#19

Achilleas Mantzios

a.mantzios@cloud.gatewaynet.com

over 1 year ago

In reply to: Laurenz Albe (#6)

Re: Quesion about querying distributed databases

On 3/5/25 11:55, Laurenz Albe wrote:

On Wed, 2025-03-05 at 12:57 +0330, me nefcanto wrote:

Right now this data is in MariaDB, on separate databases (schema) but on one
server. The solution in this situation is to have a cross-database query.
(this is the status quo of our application)

Now our team has decided to migrate to Postgres. However, we realized that
Postgres does not support cross-database queries. And if we want to do so,
we should use FDW. So, we thought we might as well put databases on separate
servers for scalability if we have to write more code. That's the reason
behind this question.

In MySQL, the terms "database" and "schema" are used for the same thing.
Not so in PostgreSQL. I think you should migrate the data into different
schemas in a single database, pretty much like you had it in MySQL.
Then you don't need a foreign data wrapper, and I bet the query can
perform as well as it did on MySQL.

Same thing in MS SQL, as of version 2017 or so.

Show quoted text

Yours,
Laurenz Albe

#20

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#18)

Re: Quesion about querying distributed databases

Dear Laurenz

I hear you, and I agree with that.

Thank you. Such a relief.

If that worked well, then it should also work well with PostgreSQL and

foreign data wrappers.

You're right. We had problems with cross-server queries on SQL Server and
MariaDB too. It seems that cross-server queries are not solved by any
engine. But we had no problem with cross-database queries. That's where it
worked well both on SQL Server and MariaDB. It seems that for
cross-database queries, Postgres returns the entire result set from the
other database to this database and then performs joins locally. It seems
that for Postgres it's not different if the foreign database is on the same
machine, or it's on another machine. I just say so by seeing the queries
and asking questions about them. I have not performed a test yet.

Well, if you split the data into several databases, that *was* sharding.

The way I understood it, sharding is when you split the database by rows,
not by tables. Examples choose a column like Tenant or User or Date as the
base of sharding. Never have I seen an example that stores Orders on one
database and Customers on another database and call it sharding. I don't
know, but we might call it distributed databases.

Consider using other, better databases than PostgreSQL (if you can find

them).

That's the point here. If we can't design a good thing on Postgres, then we
stick back to MariaDB. That's why we're researching and testing. As I
mentioned above, Postgres is amazing at some points but lacks some simple
things that other engines expose out of the box.

Perhaps you should get a consultant; the mailing list does not seem to be

the right format for that request.

We have done that over the last decade. For SQL Server and then for
MariaDB. We have come up with some very practical and useful designs.
Separating CLOBs from main tables, storing UUID only as the name of files
to match the cloud storage, storing date-times as UTC, using bigint
everywhere even for small tables for consistency, denormalizing enum
storage (storing text instead of numeric value) even in large tables, etc.
etc.

But to choose a technology, we do have enough literacy and experience. It's
just some simple questions and answers. If I know that FDW works
differently for same-server databases, then I know that we will migrate.

Don't ever store arrays in the database. It will be a nightmare.

This is a very interesting claim. May I ask you to share its problems and
your experience?

On Thu, Mar 6, 2025 at 11:34 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Thu, 2025-03-06 at 06:13 +0330, me nefcanto wrote:

I once worked with a monolithic SQL Server database with more than 10

billion

records and about 8 Terabytes of data. A single backup took us more than

21 days.

It was a nightmare. Almost everybody knows that scaling up has a

ceiling, but

scaling out has no boundaries.

I hear you, and I agree with that.

We initially chose to break the database into smaller databases, because

it

seemed natural for our modularized monolith architecture. And it worked

great

for SQL Server. If you're small, we host them all on one server. If you

get

bigger, we can put heavy databases on separate machines.

So you mean that you had those databases on different servers?
How would a cross-database query work in that case? It must be something
akin to foreign data in PostgreSQL.

If that worked well, then it should also work well with PostgreSQL and
foreign data wrappers. Look at the execution plan you got on SQL Server
and see where PostgreSQL chooses a different plan. Then try to improve
that.
We can try to help if we see actual plans.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Well, if you split the data into several databases, that *was* sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back

to

monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL

Server

does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how
to design based on its features. That's why I'm asking.

Excluding options from the start is limiting yourself. Consider using
other, better databases than PostgreSQL (if you can find them).

It is difficult to come up with a concrete design based on the information
you provided. Perhaps you should get a consultant; the mailing list does
not seem to be the right format for that request.

Typically, you split the data in a ways that they have few
interconnections,
for example per customer, so that you don't regularly end up joining data
from different databases (shards).

One thing that comes to my mind, is to use custom types. Instead of

storing

data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database.

Don't ever store arrays in the database. It will be a nightmare.
You seem to be drawn to questionable data design...

Yours,
Laurenz Albe