Clustered index to preserve data locality in a multitenant application?

Started by Nicolas Grillyover 9 years ago20 messagesgeneral

nicolas@vocationcity.com

over 9 years ago

Hello,

We are developing a multitenant application which is currently based on
MySQL, but we're thinking of migrating to PostgreSQL.

We rely on clustered indexes to preserve data locality for each tenant.
Primary keys start with the tenant ID. This way, rows belonging to the same
tenant are stored next to each other. Because all requests hit only one
tenant, this is a great performance improvement.

PostgreSQL doesn't have clustered indexes — I'm aware of the CLUSTER
command but it's a one-time operation — and I'm wondering if this can be a
problem or not.

Let's say we have a table containing data for 10,000 tenants and 10,000
rows per tenant, for a total of 100,000,000 rows. Let's say each 8 KB block
contains ~10 rows. Let's way we want to compute the sum of an integer
column for all rows belonging to a given tenant ID.

In MySQL/InnoDB, rows are stored in the leaf nodes of a B-tree. To compute
the sum, MySQL has to read at least 1,000 blocks (each block containing ~10
rows). I deliberately neglect the cost of walking the B-tree intermediate
nodes.

By comparison, PostgreSQL has to read at least 10,000 blocks (each block
containing ~10 rows, but most of the time, only one row will match the
tenant ID, other rows belonging to other tenants).

A few questions:

- Am I missing something?
- Am I overestimating the benefit of a clustered index in our case, and the
cost of not having one in PostgreSQL?
- Is there another technical solution to this problem?

Thanks,

*Nicolas Grilly*
*Garden / Vocation City* - Web Recruitment Software
Managing Partner
+33 6 03 00 25 34
www.vocationcity.com

vivek@khera.org

over 9 years ago

In reply to: Nicolas Grilly (#1)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 7:10 AM, Nicolas Grilly
<nicolas@vocationcity.com> wrote:

Let's say we have a table containing data for 10,000 tenants and 10,000 rows
per tenant, for a total of 100,000,000 rows. Let's say each 8 KB block
contains ~10 rows. Let's way we want to compute the sum of an integer column
for all rows belonging to a given tenant ID.

I'll assume you have an index on the tenant ID. In that case, your
queries will be pretty fast.

On some instances, we have multi-column indexes starting with the
tenant ID, and those are used very effectively as well.

I never worry about data locality.

Depending on your data distribution, you may want to consider table
partitions based on the tenant id. I personally never bother with
that, but split based on some other key in the data.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

nicolas@gardentechno.com

over 9 years ago

In reply to: Vick Khera (#2)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 7:26 PM, Vick Khera <vivek@khera.org> wrote:

I'll assume you have an index on the tenant ID. In that case, your
queries will be pretty fast.

On some instances, we have multi-column indexes starting with the
tenant ID, and those are used very effectively as well.

I never worry about data locality.

Yes, we have an index starting with the tenant ID, and the query uses the
index.

But I'm still worried about PostgreSQL having to fetch 10 times more pages
from the disk than MySQL. If each 8K page contains approximately 10 rows,
fetching one page in MySQL will return 10 "useful" rows belonging to the
tenant. By comparison, fetching one page in PostgreSQL will probably return
only 1 "useful" row belonging to the tenant. In terms of IO, it's a big
difference.

Depending on your data distribution, you may want to consider table
partitions based on the tenant id. I personally never bother with
that, but split based on some other key in the data.

You're right. I don't really need a clustered index (like InnoDB). What I
need is to store rows belonging to the same tenant close from each other.
Partitioning can help with that. But the lack of declarative partitioning
makes it cumbersome (I've read this is worked on).

Kenneth Marshall

ktm@rice.edu

over 9 years ago

In reply to: Nicolas Grilly (#3)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 07:51:58PM +0200, Nicolas Grilly wrote:

On Tue, Aug 30, 2016 at 7:26 PM, Vick Khera <vivek@khera.org> wrote:

I'll assume you have an index on the tenant ID. In that case, your
queries will be pretty fast.

On some instances, we have multi-column indexes starting with the
tenant ID, and those are used very effectively as well.

I never worry about data locality.

Yes, we have an index starting with the tenant ID, and the query uses the
index.

But I'm still worried about PostgreSQL having to fetch 10 times more pages
from the disk than MySQL. If each 8K page contains approximately 10 rows,
fetching one page in MySQL will return 10 "useful" rows belonging to the
tenant. By comparison, fetching one page in PostgreSQL will probably return
only 1 "useful" row belonging to the tenant. In terms of IO, it's a big
difference.

Depending on your data distribution, you may want to consider table
partitions based on the tenant id. I personally never bother with
that, but split based on some other key in the data.

You're right. I don't really need a clustered index (like InnoDB). What I
need is to store rows belonging to the same tenant close from each other.
Partitioning can help with that. But the lack of declarative partitioning
makes it cumbersome (I've read this is worked on).

Hi,

We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

nicolas@gardentechno.com

over 9 years ago

In reply to: Kenneth Marshall (#4)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 8:17 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

Interesting!
Do you run pg_repack on a regular schedule using something like cron, or
does it run automatically in the background?
Is it compatible with PostgreSQL 9.6?

nicolas@vocationcity.com

over 9 years ago

In reply to: Nicolas Grilly (#1)

Re: Clustered index to preserve data locality in a multitenant application?

Eduardo Morras wrote:

Check BRIN indexs, they are "designed for handling very large tables in
which certain columns have some natural correlation with their physical
location within the table", I think they fit your needs.

Yes, a BRIN index on the tenant ID would be very useful if the rows in the
heap were naturally sorted by the tenant ID, but they are not. They are
naturally sorted by their order of insertion, which is completely
unrelated. The first step in solving this is to find a way to keep rows
belonging to the same tenant close to each other. The second step could be
to use a BRIN index.

nicolas@gardentechno.com

over 9 years ago

In reply to: Nicolas Grilly (#1)

Re: Clustered index to preserve data locality in a multitenant application?

Mike Sofen wrote:

For Nicolas’s situation, that would require 10,000 partitions – not very
useful, and each partition would be very small.

This is exactly my conclusion about using partitions in my situation.

In Postgres, as you mentioned, clustering is a “one time” operation but
only in the sense that after you add more rows, you’ll need to re-cluster
the table. Depending on the activity model for that table, that may be
feasible/ok. For example, if you load it via regular batch scripts, then
the clustering could be done after those loads. If you add rows only
rarely but then do lots of updates, then the clustering would work great.
If this is an active real time data table, then clustering would not be
viable.

The application is very interactive and news rows are inserted all the time
in my use case.

Thanks for your time,

Nicolas

Kenneth Marshall

ktm@rice.edu

over 9 years ago

In reply to: Nicolas Grilly (#5)

Re: Clustered index to preserve data locality in a multitenant application?

On Wed, Aug 31, 2016 at 05:23:50PM +0200, Nicolas Grilly wrote:

On Tue, Aug 30, 2016 at 8:17 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

Interesting!
Do you run pg_repack on a regular schedule using something like cron, or
does it run automatically in the background?
Is it compatible with PostgreSQL 9.6?

Hi,

We just run it via cron. In our case, we run it once a day, but depending on
your churn, it could be run once a week or more.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

nicolas@gardentechno.com

over 9 years ago

In reply to: Kenneth Marshall (#8)

Re: Clustered index to preserve data locality in a multitenant application?

On Wed, Aug 31, 2016 at 6:05 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We just run it via cron. In our case, we run it once a day, but depending
on
your churn, it could be run once a week or more.

Could you provide some numbers: what is the size of the tables or tables
that are repacked? how long does it take? is there a performance impact
during the repack?

Kenneth Marshall

ktm@rice.edu

over 9 years ago

In reply to: Nicolas Grilly (#9)

Re: Clustered index to preserve data locality in a multitenant application?

On Wed, Aug 31, 2016 at 06:06:54PM +0200, Nicolas Grilly wrote:

On Wed, Aug 31, 2016 at 6:05 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We just run it via cron. In our case, we run it once a day, but depending
on
your churn, it could be run once a week or more.

Could you provide some numbers: what is the size of the tables or tables
that are repacked? how long does it take? is there a performance impact
during the repack?

Hi,

The key is that there must be enough I/O capacity to handle your routine
DB access needs plus the additional usage by the repack operation. Running
the repack during a slow period does not tangibly impact DB performance and
completes in a reasonable period of time. Your best best would be to set
up a test environment. By way of example, running our repack during peak
usage can take over 5 hours and less than 30m outside of peak when there
is excess I/O capacity. You should be able to use your I/O monitoring
system to get an idea about your I/O headroom.

Regards,
Ken

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

nicolas@gardentechno.com

over 9 years ago

In reply to: Kenneth Marshall (#4)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 8:17 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

It looks like Instagram has been using pg_reorg (the ancestor of pg_repack)
to keep all likes from the same user contiguous on disk, in order to
minimize disk seeks.

http://instagram-engineering.tumblr.com/post/40781627982/handling-growth-with-postgres-5-tips-from

This is very similar to what I'm trying to achieve.

The article is 3 years old. I'd be curious to know if they still do that.

Nicolas

bench@silentmedia.com

over 9 years ago

In reply to: Nicolas Grilly (#11)

Re: Clustered index to preserve data locality in a multitenant application?

On Aug 31, 2016, at 2:55 PM, Nicolas Grilly <nicolas@gardentechno.com> wrote:

It looks like Instagram has been using pg_reorg (the ancestor of pg_repack) to keep all likes from the same user contiguous on disk, in order to minimize disk seeks.

http://instagram-engineering.tumblr.com/post/40781627982/handling-growth-with-postgres-5-tips-from <http://instagram-engineering.tumblr.com/post/40781627982/handling-growth-with-postgres-5-tips-from>

This is very similar to what I'm trying to achieve.

The article is 3 years old. I'd be curious to know if they still do that.

If what they did 3 years ago is similar to what you are trying to do today, who cares what they are doing today? (Besides using pg_repack instead of pg_reorg, of course.)

For what it's worth, we use pg_repack on a regular basis and it works exactly as advertised.

nicolas@gardentechno.com

over 9 years ago

In reply to: Ben (#12)

Re: Clustered index to preserve data locality in a multitenant application?

On Thu, Sep 1, 2016 at 12:05 AM, Ben Chobot <bench@silentmedia.com> wrote:

If what they did 3 years ago is similar to what you are trying to do
today, who cares what they are doing today? (Besides using pg_repack
instead of pg_reorg, of course.)

I'm curious because, in the meantime, Instagram could have stopped doing
this because 1) it didn't work as intended or 2) they found a better
solution.

For what it's worth, we use pg_repack on a regular basis and it works
exactly as advertised.

Thanks.

nicolas@gardentechno.com

over 9 years ago

In reply to: Kenneth Marshall (#4)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 8:17 PM, Kenneth Marshall <ktm@rice.edu> wrote:

We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

In DB2, it seems possible to define a "clustering index" that determines
how rows are physically ordered in the "table space" (the heap).

The documentation says: "When a table has a clustering index, an INSERT
statement causes DB2 to insert the records as nearly as possible in the
order of their index values."

It looks like a kind of "continuous CLUSTER/pg_repack". Is there something
similar available or planned for PostgreSQL?

nicolas@gardentechno.com

over 9 years ago

In reply to: Nicolas Grilly (#14)

Re: Clustered index to preserve data locality in a multitenant application?

On Thu, Sep 1, 2016 at 12:31 AM, Nicolas Grilly <nicolas@gardentechno.com>
wrote:

In DB2, it seems possible to define a "clustering index" that determines
how rows are physically ordered in the "table space" (the heap).

The documentation says: "When a table has a clustering index, an INSERT
statement causes DB2 to insert the records as nearly as possible in the
order of their index values."

It looks like a kind of "continuous CLUSTER/pg_repack". Is there something
similar available or planned for PostgreSQL?

I forgot the links to DB2 documentation about clustering index:

http://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/intro/src/tpc/db2z_clusteringindexes.html

http://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0061098.html

ineyman@perceptron.com

over 9 years ago

In reply to: Nicolas Grilly (#14)

Re: Clustered index to preserve data locality in a multitenant application?

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Nicolas Grilly
Sent: Wednesday, August 31, 2016 6:32 PM
To: Kenneth Marshall <ktm@rice.edu>
Cc: Vick Khera <vivek@khera.org>; pgsql-general <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 8:17 PM, Kenneth Marshall <ktm@rice.edu<mailto:ktm@rice.edu>> wrote:
We have been using the extension pg_repack to keep a table groomed into
cluster order. With an appropriate FILLFACTOR to keep updates on the same
page, it works well. The issue is that it needs space to rebuild the new
index/table. If you have that, it works well.

In DB2, it seems possible to define a "clustering index" that determines how rows are physically ordered in the "table space" (the heap).

The documentation says: "When a table has a clustering index, an INSERT statement causes DB2 to insert the records as nearly as possible in the order of their index values."

It looks like a kind of "continuous CLUSTER/pg_repack". Is there something similar available or planned for PostgreSQL?

Don’t know about plans to implement clustered indexes in PostgreSQL.

Not sure if this was mentioned, MS SQL Server has clustered indexes, where heap row is just stored on the leaf level of the index.
Oracle also has similar feature: IOT, Index Organized Table.

It seems to me (may be I’m wrong), that in PostgreSQL it should be much harder to implement clustered index (with the heap row stored in the index leaf) because of the way how MVCC implemented: multiple row versions are stored in the table itself (e.g. Oracle for that purpose keeps table “clean” and stores multiple row versions in UNDO tablespace/segment).

Regards,
Igor Neyman

emorrasg@yahoo.es

over 9 years ago

In reply to: Nicolas Grilly (#6)

Re: Clustered index to preserve data locality in a multitenant application?

On Wed, 31 Aug 2016 17:33:18 +0200
Nicolas Grilly <nicolas@vocationcity.com> wrote:

Eduardo Morras wrote:

Check BRIN indexs, they are "designed for handling very large
tables in which certain columns have some natural correlation with
their physical location within the table", I think they fit your
needs.

Yes, a BRIN index on the tenant ID would be very useful if the rows
in the heap were naturally sorted by the tenant ID, but they are not.
They are naturally sorted by their order of insertion, which is
completely unrelated. The first step in solving this is to find a way
to keep rows belonging to the same tenant close to each other. The
second step could be to use a BRIN index.

Then you can make multiple column partial indexes:

CREATE INDEX CONCURRENTLY tenant_01_idx ON big_tenant_table (the_columns_with_data_you_need, tenant_id) WHERE tenant_id = 1;
CREATE INDEX CONCURRENTLY tenant_02_idx ON big_tenant_table (the_columns_with_data_you_need, tenant_id) WHERE tenant_id = 2;

This way each index has the data for a tenant, is updated only when the data for that tenant is updated and each index has it own files and you can reindex to clean index content and debloat.

REINDEX INDEX tenant_01_idx;

Or grouping them if there are too much indexes:
CREATE INDEX CONCURRENTLY tenant_01_idx ON big_tenant_table (the_columns_with_data_you_need, tenant_id) WHERE tenant_id <= 300;
CREATE INDEX CONCURRENTLY tenant_02_idx ON big_tenant_table (the_columns_with_data_you_need, tenant_id) WHERE tenant_id > 300 AND tenant_id <= 600;

---   ---
Eduardo Morras <emorrasg@yahoo.es>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

nicolas@gardentechno.com

over 9 years ago

In reply to: Igor Neyman (#16)

Re: Clustered index to preserve data locality in a multitenant application?

On Thu, Sep 1, 2016 at 3:08 PM, Igor Neyman <ineyman@perceptron.com> wrote:

Don’t know about plans to implement clustered indexes in PostgreSQL.

It was discussed on the mailing list in the past.

I found an interesting thread dated from 2012 about integrating pg_reorg
(the ancestor of pg_repack) in PostgreSQL core:

/messages/by-id/CAB7nPqTGmNUFi+W6F1iwmf7J-o6sY+xxo6Yb=mkUVYT-CG-B5A@mail.gmail.com

There is also an item titled "Automatically maintain clustering on a table"
in the TODO list:

https://wiki.postgresql.org/wiki/Todo#CLUSTER

Not sure if this was mentioned, MS SQL Server has clustered indexes, where

heap row is just stored on the leaf level of the index.

Oracle also has similar feature: IOT, Index Organized Table.

It seems to me (may be I’m wrong), that in PostgreSQL it should be much
harder to implement clustered index (with the heap row stored in the index
leaf) because of the way how MVCC implemented: multiple row versions are
stored in the table itself (e.g. Oracle for that purpose keeps table
“clean” and stores multiple row versions in UNDO tablespace/segment).

DB2, like PostgreSQL, stores rows in a heap, and not in the leafs of a
Btree. But it's possible to define a "clustering" key for a table. When it
is defined, DB2 tries to keep the rows in the heap ordered according to the
clustering key. If DB2 can’t find space on the page where the row should
go, then it looks a few pages before and after and puts it there, and if it
still can’t find space then puts it at the end. There is also a feature
called "multidimensional clustering" which is even more sophisticated.
There is also a command REORG, which would be the equivalent of a
non-blocking CLUSTER in PostgreSQL.

I think DB2's approach is interesting because it shows that maintaining
spatial coherency is possible with a heap, without having to store rows in
a Btree (like InnoDB).

nicolas@vocationcity.com

over 9 years ago

In reply to: Nicolas Grilly (#1)

Re: Clustered index to preserve data locality in a multitenant application?

On Tue, Aug 30, 2016 at 1:10 PM, Nicolas Grilly <nicolas@vocationcity.com>
wrote:

We are developing a multitenant application which is currently based on
MySQL, but we're thinking of migrating to PostgreSQL.

We rely on clustered indexes to preserve data locality for each tenant.
Primary keys start with the tenant ID. This way, rows belonging to the same
tenant are stored next to each other. Because all requests hit only one
tenant, this is a great performance improvement.

PostgreSQL doesn't have clustered indexes — I'm aware of the CLUSTER
command but it's a one-time operation — and I'm wondering if this can be a
problem or not.

Let's say we have a table containing data for 10,000 tenants and 10,000
rows per tenant, for a total of 100,000,000 rows. Let's say each 8 KB block
contains ~10 rows. Let's way we want to compute the sum of an integer
column for all rows belonging to a given tenant ID.

In MySQL/InnoDB, rows are stored in the leaf nodes of a B-tree. To compute
the sum, MySQL has to read at least 1,000 blocks (each block containing ~10
rows). I deliberately neglect the cost of walking the B-tree intermediate
nodes.

By comparison, PostgreSQL has to read at least 10,000 blocks (each block
containing ~10 rows, but most of the time, only one row will match the
tenant ID, other rows belonging to other tenants).

A few questions:

- Am I missing something?
- Am I overestimating the benefit of a clustered index in our case, and
the cost of not having one in PostgreSQL?
- Is there another technical solution to this problem?

After a few days thinking and researching about this, I've come to the
following conclusion:

1) Trying to preserve data locality in a multi-tenant application is a
legitimate requirement (because it can significantly improve performance).
2) But index organized tables are not the right answer.
3) Partitioning by tenant_id, sorting heap rows by tenant_id, and/or
creating covering indexes (for index-only scans) are better solutions.

I found at least two examples of well-known companies using PostgreSQL at
scale for a multi-tenant application, and taking specific steps to preserve
data locality:

- Heap shards its data by customer_id (all rows in a logical shard belong
to the same customer — except for small customers, but it's easy to make
their queries fast anyway) [1]https://news.ycombinator.com/item?id=12412216.
- Instagram uses pg_reorg (the ancestor of pg_repack) to keep likes created
by the same user contiguous on disk [2]http://instagram-engineering.tumblr.com/post/40781627982 /handling-growth-with-postgres-5-tips-from.

At first, I thought that index organized tables (aka clustered indexes)
were the solution, and that missing them in PostgreSQL could be an issue,
but I was wrong. They are not the right tool for the job.

Index organized tables are a powerful tool for an application that needs
very fast lookups and range scans on the primary key. But they also have
significant drawbacks [3]https://dzone.com/articles/unreasonable-defaults-primary:

- Lookups on secondary keys are slower (because they need one more
indirection).
- The index is bigger (because rows are stored directly in the index,
instead of a heap).
- InnoDB can only use the primary key as clustering key, which is very
restrictive (for example, in PostgreSQL, most recently inserted/updated
rows are naturally clustered together, independently of the chosen primary
key).

So, PostgreSQL uses heap organized tables instead of index organized
tables, and this is a good thing, at least for the kind of queries my
application needs. But, at scale, I still need to find a way to preserve
data locality for each tenant.

The solutions I've identified are:

- Partition by tenant_id as suggested by Thomas Kellerer earlier in this
thread. Declarative partitioning will make this easier in a future version
of PostgreSQL. It's also possible to "shard" using Citus Data (like Heap or
CloudFlare).
- Periodically sort rows by tenant_id in the heap, using something like
pg_repack, as suggested by Kenneth Marshall and Ben Chobot.
- Create covering indexes, which let PostgreSQL do index-only scans
(exactly like an index organized table), as suggested by Eduardo Morras and
engineers at Heap [4]"Powering Heap", https://www.youtube.com/watch?v=NVl9_6J1G60.

It looks like I can move forward with our migration from MySQL to
PostgreSQL, without worrying about the lack of clustered indexes, because
there are better solutions to keep tenant data contiguous!

Thanks for all your comments.

Nicolas Grilly

[1]: https://news.ycombinator.com/item?id=12412216
[2]: http://instagram-engineering.tumblr.com/post/40781627982 /handling-growth-with-postgres-5-tips-from
/handling-growth-with-postgres-5-tips-from
[3]: https://dzone.com/articles/unreasonable-defaults-primary
[4]: "Powering Heap", https://www.youtube.com/watch?v=NVl9_6J1G60

Jim.Nasby@BlueTreble.com

over 9 years ago

In reply to: Nicolas Grilly (#19)

Re: Clustered index to preserve data locality in a multitenant application?

On 9/6/16 11:21 AM, Nicolas Grilly wrote:

It looks like I can move forward with our migration from MySQL to
PostgreSQL, without worrying about the lack of clustered indexes,
because there are better solutions to keep tenant data contiguous!

First rule of performance tuning: don't. :) There are lots of areas
where Postgres can be expected to perform better than MySQL, so without
testing your app you really don't know how it's going to fare.

There's also another option: use a logical replication system (such as
pg_logical, BDR, londiste or Slony) to maintain at least one replica.
You can take that replica down to perform maintenance (such as a
database-wide CLUSTER) as needed, and let replication catch up once you
bring it back online. That, combined with scripted failover makes a lot
of database maintenance items far easier, at the cost of having to
maintain the replication. Depending on your needs, a major benefit to
this method is it makes major version upgrades very simple: you just
stand up a new replica on the new version and then failover to it. If
anything goes wrong, you can fail back to the old version without losing
any data.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general