Horizontal scalability/sharding

Started by Bruce Momjianover 10 years ago113 messages

bruce@momjian.us

over 10 years ago

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)

October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.

October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding. As things move forward, I am increasingly convinced that write
scaling will be needed soon, that the XC approach is the only reasonable
way to do it, and that FDWs are the cleanest way to get it into community
Postgres.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.

Glad to see it. Many people have been pushing such things for years, so it
is good to finally see some debate about this on Hackers.

I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

What other architectures were discussed? Where was that discussion?

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I personally support the view that we should put scalability features into
Postgres core, rather than run separate forks.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such discussion,
AFAIK. Surely it can be possible to move required subsystems across?

, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)

That is a huge presumption. Not discussed or technically analyzed in any
way with the community.

October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

Cough, cough. You must surely be joking that "most people liked the FDW
approach"? How did we measure the acceptance of this approach?

What actually is the FDW approach? Since its not been written down
anywhere, or even explained verbally, how can anyone actually agree to it?

July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards.
Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek
also? I'll pass on to others also.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
Presumably EDB is invited also? If Koichi is a leading organizer of this,
why are there two meetings?

October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding

at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding.

Thanks

As things move forward, I am increasingly convinced that write
scaling will be needed soon,

that the XC approach is the only reasonable way to do it,

and that FDWs are the cleanest way to get it into community
Postgres.

Those two things aren't at all obvious to me.

Please don't presume my opposition. If the technical information were made
public, I might understand and agree with "the FDW approach", perhaps
others also. 2ndQuadrant is certainly happy to become involved in any team
aiming to add features to Postgres core, as long as that makes sense. There
may be areas we can all agree upon even if the full architecture remains in
doubt.

Before the community commits to a long term venture together we should see
the plan. Like all IT projects, expensive failure is possible and the lack
of a design is a huge flashing red warning light for me at present. If that
requires a meeting of all Developers, why are the meetings for this
specifically not happening at the agreed Developer meetings?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Oleg Bartunov

obartunov@gmail.com

over 10 years ago

In reply to: Simon Riggs (#2)

Re: Horizontal scalability/sharding

On Sun, Aug 30, 2015 at 5:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.

Glad to see it. Many people have been pushing such things for years, so it
is good to finally see some debate about this on Hackers.

I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

What other architectures were discussed? Where was that discussion?

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They
didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a
highway
on-ramp fast --- even if you don't need it, you want to know it is there.

+1

2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I personally support the view that we should put scalability features into
Postgres core, rather than run separate forks.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such discussion,
AFAIK. Surely it can be possible to move required subsystems across?

, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)

That is a huge presumption. Not discussed or technically analyzed in any
way with the community.

October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

Cough, cough. You must surely be joking that "most people liked the FDW
approach"? How did we measure the acceptance of this approach?

What actually is the FDW approach? Since its not been written down
anywhere, or even explained verbally, how can anyone actually agree to it?

July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

A little correction about Postgres Professional. We are concentrated on
idea to have one distributed transaction manager, originally DTM, now we
have better name XTM, which is neutral to actual cluster realization. For
example, we are testing it with XL, ported to 9.4, but we were planning to
extend tests to pg_shard, postgres_fdw. My idea was to have at least XTM
committed to 9.6, so all parties could work on their implementation much
easier.

August, 2015: While speaking at SFPUG, Citus Data approached me about

joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5
forwards. Please can invites be posted to myself, Pavan Deolasee and Petr
Jelinek also? I'll pass on to others also.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
Presumably EDB is invited also? If Koichi is a leading organizer of this,
why are there two meetings?

October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding

at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding.

Thanks

As things move forward, I am increasingly convinced that write
scaling will be needed soon,

+1

that the XC approach is the only reasonable way to do it,

and that FDWs are the cleanest way to get it into community
Postgres.

Those two things aren't at all obvious to me.

Please don't presume my opposition. If the technical information were made
public, I might understand and agree with "the FDW approach", perhaps
others also. 2ndQuadrant is certainly happy to become involved in any team
aiming to add features to Postgres core, as long as that makes sense. There
may be areas we can all agree upon even if the full architecture remains in
doubt.

Before the community commits to a long term venture together we should see
the plan. Like all IT projects, expensive failure is possible and the lack
of a design is a huge flashing red warning light for me at present. If that
requires a meeting of all Developers, why are the meetings for this
specifically not happening at the agreed Developer meetings?

At PGCon we agreed to have such meeting in Vienna at least. But I think we
should be prepared and try to clean all our issues before. It looks like we
already out of time,but probably we could meet in Hong Kong ?

Honestly, I still don't know which approach is better, we already played
with XL (ported to 9.4) and identified some very strong issues with
inconsistency, which scared us, especially taking into account how easy we
found them. XC people have fixed them, but I'm not sure if they were
fundamental and if we could construct more sophisticated tests and find
more issues in XC/XL. We also a bit disappointed by Huawei position about
CSN patch, we hoped to use for our XTM. FDW approach has been actively
criticized by pg_shard people and that's also made me a bit suspicious. It
looks like we are doomed to continue several development forks, so we
decided to work on very important common project, XTM, which we hoped could
be accepted by all parties and eventually committed to 9.6. Now I see we
were right, unfortunately.

Again, could we organize meeting somewhere in September ? US is not good
for us, but other places should be ok. I want to have an agreement at
least on XTM. We still are testing various approaches, though. We could
present results of our experiments and are open to discussion. It's not
easy project, but it's something we could do for 9.6.

I'm very glad Bruce started this discussion in -hackers, since it's silly
to me to participate in both threads :) Let's meet in September !

Show quoted text

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Simon Riggs (#2)

Re: Horizontal scalability/sharding

On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote:

On 30 August 2015 at 03:17, Bruce Momjian <bruce@momjian.us> wrote:

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres.

Glad to see it. Many people have been pushing such things for years, so it is
good to finally see some debate about this on Hackers.

Agreed. Right now, in our community, we are only seeing users who are
happy with what Postgres offers but think they might need massive
horizontal scalability in the future. I think there is a larger group
that cares about massive horizontal scalability, but those people are
using other software right now, so we don't see them yet.

Without a roadmap for built-in massive horizontal scalability, I think
Postgres adoption will eventually suffer.

I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013:ï¿½ As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

What other architectures were discussed? Where was that discussion?

That was mostly my conclusion. I explained it to small groups at
conferences and Postgres user groups. No one said I was wrong, but that
is about the level of debate I had.

2014:ï¿½ I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres.ï¿½ (I don't remember where the original idea
came from.)ï¿½ It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I personally support the view that we should put scalability features into
Postgres core, rather than run separate forks.

Good, I do think it is time, but as I stated above, there is limited
interest in our current community, so the tolerance for additional
community code to accomplish this is also limited. This is the big
thing that had me excited about using FDWs --- FDW improvements can get
us closer to sharding without requiring community acceptance of
sharding-only features.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such discussion,
AFAIK. Surely it can be possible to move required subsystems across?

Well, I have had many such discussions with XC/XL folks, and that was my
opinion. I have seen almost no public discussion about this because the
idea had almost no chance of success. If it was possible, someone would
have already suggested it on this list.

, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding.ï¿½ (We already see some of those
FDW features in 9.5.)

That is a huge presumption. Not discussed or technically analyzed in any way
with the community.

True. It seemed pretty obvious to me.

October, 2014:ï¿½ EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team.ï¿½ The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015:ï¿½ I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

Cough, cough. You must surely be joking that "most people liked the FDW
approach"? How did we measure the acceptance of this approach?ï¿½

Well, I didn't have my audience-meter with me at the time. ;-)

The discussion was mostly in the hallway after the unconference session,
"Future of PostgreSQL shared-nothing cluster" by Konstantin Knizhnik,
Alexander Korotkov, and Oleg Bartunov. Again, when I explained the
ability to use FDWs to get sharding into Postgres with minimal
additional code, no one said the idea was crazy, which I took as a big
thumbs-up! When I asked why to continue with XC/XL, I was told those
were more mature and more customer-ready, which is true. I will not
quote people from the from the hallway discussion for privacy reasons.

What actually is the FDW approach? Since its not been written down anywhere, or
even explained verbally, how can anyone actually agree to it?

Well, my sharding talk just has the outlines of an approach. I think
there are five broad segments:

* FDW push-down of joins, sorts, aggregates
* ability to send FDW requests in parallel
* transaction/snapshot manager to allow ACID transctions on shards
* simpler user partitioning API
* infrastructure to manage shards, including replicated tables used for joins

July, 2015:ï¿½ Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June.ï¿½ A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015:ï¿½ While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team.ï¿½ They have been invited to the September
1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5 forwards.

Yes, I saw the blog post about that:

http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/

Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek also?
I'll pass on to others also.

OK, I will send you a separate email and you can then supply their email
addresses.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
Presumably EDB is invited also? If Koichi is a leading organizer of this, why
are there two meetings?

I certainly have heard nothing about it, except third-hand people
telling me a meeting is happening. I assumed those meetings where
XC/XL-specific.

that the XC approach is the only reasonableï¿½way to do it,
and that FDWs are the cleanest way to get it into community
Postgres.

Those two things aren't at all obvious to me.

Please don't presume my opposition. If the technical information were made
public, I might understand and agree with "the FDW approach", perhaps others
also.

Well, the beauty of my approach is that we didn't need any technical
direction or buy-in on sharding from the community to improve FDWs. I
think now is the right time to try to get that buy-in, or adjust our
approach.

There isn't really much more to my _analysis_ than I presented. There
is certainly a lot more work to do to even decide this is the right
approach. Some of the groups already involved have more experience in
trying this, e.g. Citus Data.

2ndQuadrant is certainly happy to become involved in any team aiming to
add features to Postgres core, as long as that makes sense. There may be areas
we can all agree upon even if the full architecture remains in doubt.

Right.

Before the community commits to a long term venture together we should see the
plan. Like all IT projects, expensive failure is possible and the lack of a
design is a huge flashing red warning light for me at present. If that requires
a meeting of all Developers, why are the meetings for this specifically not
happening at the agreed Developer meetings?

Well, what meetings should it be at? I don't think there was clear
enough direction for the June 2015 PGCon meeting. Is there an
unconference in Vienna? One thing I saw at the last PGCon is that this
is a big topic, so I think having a dedicated room and 3-hour slot for
it is nice.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Oleg Bartunov (#3)

Re: Horizontal scalability/sharding

On Sun, Aug 30, 2015 at 10:36:23PM +0300, Oleg Bartunov wrote:

Honestly, I still don't know which approach is better, we already played with
XL (ported to 9.4)ï¿½ and identified some very strong issues with inconsistency,
which scared us, especially taking into account how easy we found them. XC
people have fixed them, but I'm not sure if they were fundamental and if we
could construct more sophisticated tests and find more issues in XC/XL. We also
a bit disappointed by Huawei position about CSN patch, we hoped to use forï¿½ our
XTM.ï¿½ FDW approach has been actively criticized by pg_shard people and that's
also made me a bit suspicious.ï¿½

Yep, that has me concerned too. The pg_shard people will be on the
September 1 call and are working on a Google document to explain their
concerns about FDWs for sharding.

It looks likeï¿½ we are doomed to continue
several development forks, so we decided to work on very important common
project, XTM, which we hoped could be accepted by all parties and eventually
committed to 9.6.ï¿½ Now I see we were right, unfortunately.ï¿½ï¿½

Yes, the ability to add independent parts that can eventually be used
for sharding is a strong indication that doing this incrementally is a
good approach.

Again, could we organize meeting somewhere in September ?ï¿½ US is not good for
us, but other places should be ok. I want to have an agreementï¿½ at least on
XTM. We still are testing various approaches, though. We could present results
of our experiments and are open to discussion. It's not easy project, but it's
something we could do for 9.6.

Good. XTM is a must-have for several use-cases, including sharding.

I'm very glad Bruce started this discussion in -hackers, since it's silly to me
to participate in both threads :)ï¿½ Let's meet in September !

In summary, I think we need to start working on built-in sharding, and
FDWs are the only way I can see to do it with minimal code changes,
which I think might be a community requirement. It might not work, but
right now, it is the only possible approach I can see.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#4)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 7:29 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Sun, Aug 30, 2015 at 03:31:10PM +0100, Simon Riggs wrote:

I realized that we would never get community acceptance to dump the

XC

(or XL) code needed for sharding into community Postgres

How or why did you realize that? There has never been any such

discussion,

AFAIK. Surely it can be possible to move required subsystems across?

Well, I have had many such discussions with XC/XL folks, and that was my
opinion. I have seen almost no public discussion about this because the
idea had almost no chance of success. If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's not
possible to know, and what is done is done.

July, 2015: Oleg Bartunov and his new company Postgres Professional

(PP)

started to think about joining the FDW approach, rather than working

on

XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015: While speaking at SFPUG, Citus Data approached me

about

joining the FDW sharding team. They have been invited to the

September

1 meeting, as have the XC and XL people.

2ndQuadrant is working in this area, specifically bringing XL 9.5

forwards.

Yes, I saw the blog post about that:

http://blog.2ndquadrant.com/working-towards-postgres-xl-9-5/

Please can invites be posted to myself, Pavan Deolasee and Petr Jelinek

also?

I'll pass on to others also.

OK, I will send you a separate email and you can then supply their email
addresses.

FWIW, I would be interested in that as well. I worked in this area of
things for a couple of years as well FWIW.

Koichi Suzuki is arranging a meeting in Hong Long for XC/XL discussions.
Presumably EDB is invited also? If Koichi is a leading organizer of

this, why

are there two meetings?

I certainly have heard nothing about it, except third-hand people
telling me a meeting is happening. I assumed those meetings where
XC/XL-specific.

Yep, that's my understanding as well and AFAIK as I know things have been
carried this way until now, aka XC/XL and Postgres core are aimed to live
as separate communities.
--
Michael

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Michael Paquier (#6)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:

Well, I have had many such discussions with XC/XL folks, and that was my
opinion.ï¿½ I have seen almost no public discussion about this because the
idea had almost no chance of success.ï¿½ If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's not
possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.

Of course, people have concerns, and FDWs might need to be improved, but
it is something worth researching. We might find out FDWs can't be used
at all, and that we have to either add much more code to Postgres to do
sharding, do something like pg_shard, or not implement built-in sharding
at all, but at least it is time to research this.

OK, I will send you a separate email and you can then supply their email
addresses.

FWIW, I would be interested in that as well. I worked in this area of things
for a couple of years as well FWIW.

OK, I will send you an email.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Bruce Momjian (#7)

Re: Horizontal scalability/sharding

On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:

Well, I have had many such discussions with XC/XL folks, and that was my
opinion.ï¿½ I have seen almost no public discussion about this because the
idea had almost no chance of success.ï¿½ If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's not
possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.

Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#7)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 11:08 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:

Well, I have had many such discussions with XC/XL folks, and that was my
opinion. I have seen almost no public discussion about this because the
idea had almost no chance of success. If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's not
possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.

If I recall correctly in terms of numbers, that's indeed 40k of code,
the main areas of XC code being the GTM, the planner changes for
expression and join push down, and the connection pooler for parallel
query execution.

ISTM that FDW is a portion of the puzzle, there are other pieces that
could be used toward an in-core integration, like the parallel stuff
Amit Kapila is working on to allow remote query execution in parallel
of local scans. Also, XC/XL were performing well on OLTP thanks to the
connection pooler: this should indeed be part of the FDW portion
managing the foreign scans. This may sound like a minor issue compared
to the others, but already established connections help a lot when
scaling out with foreign servers.

Of course, people have concerns, and FDWs might need to be improved, but
it is something worth researching. We might find out FDWs can't be used
at all, and that we have to either add much more code to Postgres to do
sharding, do something like pg_shard, or not implement built-in sharding
at all, but at least it is time to research this.

I am really looking forward to hearing the arguments of the authors of
pg_shard on the matter.

OK, I will send you a separate email and you can then supply their email
addresses.

FWIW, I would be interested in that as well. I worked in this area of things
for a couple of years as well FWIW.

OK, I will send you an email.

Thanks.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#8)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 11:48 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:

Well, I have had many such discussions with XC/XL folks, and that was my
opinion. I have seen almost no public discussion about this because the
idea had almost no chance of success. If it was possible, someone would
have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's not
possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.

Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.

Forks are aimed to die without proper maintenance resources. Still,
for XC/XL, what does not help is the complication of the architecture
and SPOF management, particularly thinking with the GTM that was
something completely new and not well understood (there is a GTM
standby but this model is weak IMO and does not scale similarly to
what you get with standbys, and impacts the overall performance of the
cluster).
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On Sun, Aug 30, 2015 at 7:47 AM, Bruce Momjian <bruce@momjian.us> wrote:

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

I think sharding like architecture is quite useful for certain kind of
workloads
where users can manage to arrange queries and data layout in an optimized
way which I hope users might agree to change if required. One thing to
consider here is what kind of scaling are we expecting in such a system and
is it sufficient considering we will keep focussed on this architecture for
horizontal scalability?
Generally speaking, the scaling in such systems is limited by the number of
profitable partitions user can create based on data and then cross-partition
transactions sucks the performance/scalability in such systems. I
understand that there is definitely a benefit in proceeding with sharding
like
architecture as there are already some PostgreSQL based forks which uses
such architecture, so if we follow same way, we can save some effort rather
than inventing or following some other architecture, however there is no
harm
is discussing pros and cons of some other architectures like Oracle RAC,
Google F1 or others.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Oleg Bartunov (#3)

Re: Horizontal scalability/sharding

At PGCon we agreed to have such meeting in Vienna at least. But I think we
should be prepared and try to clean all our issues before. It looks like we
already out of time,but probably we could meet in Hong Kong ?

Honestly, I still don't know which approach is better, we already played
with XL (ported to 9.4) and identified some very strong issues with
inconsistency, which scared us, especially taking into account how easy we
found them. XC people have fixed them, but I'm not sure if they were
fundamental and if we could construct more sophisticated tests and find
more issues in XC/XL. We also a bit disappointed by Huawei position about
CSN patch, we hoped to use for our XTM. FDW approach has been actively
criticized by pg_shard people and that's also made me a bit suspicious. It
looks like we are doomed to continue several development forks, so we
decided to work on very important common project, XTM, which we hoped could
be accepted by all parties and eventually committed to 9.6. Now I see we
were right, unfortunately.

Distributed transaction manager should support at least three things
1. Atomic commit
2. Atomic visibility
3. Consistent snapshots (e.g. required for repeatable reads and higher
isolation levels).

I have submitted patch for implementing first for FDWs. The patch adds
infrastructure to be used by all FDWs including postgres_fdw. It also adds
postgres_fdw code to use this infrastructure. The same can be used to
achieve atomic commit in postgres_fdw based sharding. Please see if XTM can
benefit from it. If there are things that are required by XTM, please post
the requirements on that thread and I will work on those. You can find the
latest patch at
/messages/by-id/CAFjFpRfANWL53+x2HdM9TCNe5pup=oPkQSSJ-KGfr-d2efj+CQ@mail.gmail.com

Again, could we organize meeting somewhere in September ? US is not good
for us, but other places should be ok. I want to have an agreement at
least on XTM. We still are testing various approaches, though. We could
present results of our experiments and are open to discussion. It's not
easy project, but it's something we could do for 9.6.

I'm very glad Bruce started this discussion in -hackers, since it's silly
to me to participate in both threads :) Let's meet in September !

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#13

Oleg Bartunov

obartunov@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#8)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 5:48 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Sun, Aug 30, 2015 at 10:08:06PM -0400, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 09:53:57AM +0900, Michael Paquier wrote:

Well, I have had many such discussions with XC/XL folks, and that

was my

opinion. I have seen almost no public discussion about this

because the

idea had almost no chance of success. If it was possible, someone

would

have already suggested it on this list.

Or perhaps people invested in this area had other obligations or lacked
motivation and/or time to work to push up for things in core. That's

not

possible to know, and what is done is done.

Well, I have talked to everyone privately about this, and concluded that
while horizontal scalability/sharding is useful, it is unlikely that the
code volume of something like XC or XL would be accepted into the
community, and frankly, now that we have FDWs, it is hard to imagine why
we would _not_ go in the FDW direction.

Actually, there was hope that XC or XL would get popular enough that it
would justify adding their code into community Postgres, but that never
happened.

AFAIK, XC/XL has already some customers and that is an additional pressure
on their development team, which is now called X2. I don't exactly know how
internal Huawei's MPPDB is connected to XC/XL.

We need community test suite for cluster and our company is working on
this. It's non-trivial work, but community will never accepts any cluster
solution without thorough testing of functionality and performance. Our
XC/XL experience was not good.

Show quoted text

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Mason S

masonlists@gmail.com

over 10 years ago

In reply to: Oleg Bartunov (#3)

Re: Horizontal scalability/sharding

We also a bit disappointed by Huawei position about CSN patch, we hoped
to use for our XTM.

Disappointed in what way? Moving to some sort of CSN approach seems to open
things up for different future ideas. In the short term, it would mean
replacing potentially large snapshots and longer visibility checks. In the
long term, perhaps CSN could help simplify the design of multi-master
replication schemes.

FDW approach has been actively criticized by pg_shard people and that's
also made me a bit suspicious. It looks like we are doomed to continue
several development forks, so we decided to work on very important common
project, XTM, which we hoped could be accepted by all parties and
eventually committed to 9.6. Now I see we were right, unfortunately.

I think the original XC project probably would have taken the FDW approach
as a basis if it had existed, with focus on push-down optimizations.

I assume that future work around PG sharding probably would be more likely
to be accepted with the FDW approach. One could perhaps work on pushing
down joins, aggregates and order by, then look at any optimizations gained
if code is moved outside of FDW. It would make sense if some kind of
generic optimization for foreign tables for SQL-based sources could be
leveraged across all databases, rather than having to re-implement for each
FDW.

There are different approaches and related features that may need to be
improved.

Do we want multiple copies of shards, like the pg_shard approach? Or keep
things simpler and leave it up to the DBA to add standbys?

Do we want to leverage table inheritance? If so, we may want to spend time
improving performance for when the number of shards becomes large with what
currently exists. If using table inheritance, we could add the ability to
specify what node (er, foreign server) the subtable lives on. We could
create top level sharding expressions that allow these to be implicitly
created.

Should we allow arbitrary expressions for shards, not just range, list and
hash?

Maybe the most community-acceptable approach would look something like

- Use FDWs, and continue to optimize push-down operations, also for
non-PostgreSQL databases.

- Use table inheritance for defining the shards. Ideally allow for
specifying that some shards may be replicated to other foreign servers (and
itself) (for pushing down joins with lookup/static tables; at this point it
should be decent for star schema based data warehouses).

- XTM/GTM hooks. Preferably we move to CSN for snapshots in core PostgreSQL
though.

Longer term, efficient internode joins would require a lot more work.

The devil is in the details. There are things that have to be addressed,
for example, if using global XIDs via GTM, not every transaction is on
every node, so we need to make sure that new clog pages get added
properly. There is also the potential to require a lot more code to be
added, like for cursor handling and stored functions. Perhaps some
limitations when using shards to foreign servers are acceptable if it is
desired to minimize code changes. XC and XL code help.

Regards,

Mason

#15

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Mason S (#14)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 02:48:31PM -0400, Mason S wrote:

I assume that future work around PG sharding probably would be more likely to
be accepted with the FDW approach. One could perhaps work on pushing down
joins, aggregates and order by, then look at any optimizations gained if code
is moved outside of FDW.ï¿½ It would make sense if some kind of generic
optimization for foreign tables for SQL-based sources could be leveraged across
all databases, rather than having to re-implement for each FDW.

There are different approaches and related features that may need to be
improved.

Do we want multiple copies of shards, like the pg_shard approach? Or keep
things simpler and leave it up to the DBA to add standbys?ï¿½

I agree with all of the above.

Do we want to leverage table inheritance? If so, we may want to spend time
improving performance for when the number of shards becomes large with what
currently exists. If using table inheritance, we could add the ability to
specify what node (er, foreign server) the subtable lives on. We could create
top level sharding expressions that allow these to be implicitly created.

Should we allow arbitrary expressions for shards, not just range, list and
hash?

Maybe the most community-acceptable approach would look something like

I think everyone agrees that our current partitioning setup is just too
verbose and error-prone for users, and needs a simpler interface, and
one that can be better optimized internally. I assume FDW-based
sharding will benefit from that work as well.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Alexander Korotkov

a.korotkov@postgrespro.ru

over 10 years ago

In reply to: Mason S (#14)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:

We also a bit disappointed by Huawei position about CSN patch, we hoped
to use for our XTM.

Disappointed in what way? Moving to some sort of CSN approach seems to
open things up for different future ideas. In the short term, it would mean
replacing potentially large snapshots and longer visibility checks. In the
long term, perhaps CSN could help simplify the design of multi-master
replication schemes.

We are disappointed because at PGCon talk Huawei announced publishing of
their CSN patch and further work in this direction together with community.
However, it's even not published yet despite all the promises. Nobody from
Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only
on ourselves. That is disappointing.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#17

Qingqing Zhou

zhouqq.postgres@gmail.com

over 10 years ago

In reply to: Oleg Bartunov (#13)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 2:12 AM, Oleg Bartunov <obartunov@gmail.com> wrote:

AFAIK, XC/XL has already some customers and that is an additional pressure
on their development team, which is now called X2. I don't exactly know how
internal Huawei's MPPDB is connected to XC/XL.

Huawei's MPPDB is based on PG-XC and tailored it more targeting OLAP scenarios.

The basic idea is that OLAP needs a shared nothing scale out
architecture for read and write. It needs ok-TP-performance, a
restricted set of functionality, and thus avoids some problems like
GTM being a central scaling bottleneck.

I advocate to merge PostgreSQL core with scale-out features, if we are
ready to face some long time functional discrepancies between the two
deployments.

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

All, Bruce:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that. However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people. The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

Particularly, I'm concerned that we already have two projects in process
aimed at horizontal scalability, and it seems like we could bring either
(or both) projects to production quality MUCH faster than we could make
an FDW-based solution work. These are:

* pg_shard
* BDR

It seems worthwhile, just as a thought experiment, if we can get where
we want using those, faster, or by combining those with new FDW features.

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency,
bottlnecked on small writes per second
* DW: small-large cluster, ACID optional,
bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional,
bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which can
be adapted to all of them.

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design. For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

On some other questions from Mason:

Do we want multiple copies of shards, like the pg_shard approach? Or
keep things simpler and leave it up to the DBA to add standbys?

We want multiple copies of shards created by the sharding system itself.
Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

Do we want to leverage table inheritance? If so, we may want to spend
time improving performance for when the number of shards becomes large
with what currently exists. If using table inheritance, we could add the
ability to specify what node (er, foreign server) the subtable lives on.
We could create top level sharding expressions that allow these to be
implicitly created.

IMHO, given that we're looking at replacing inheritance because of its
many documented limitations, building sharding on top of inheritance
seems unwise. For example, many sharding systems are hash-based; how
would an inheritance system transparently use hash keys?

Should we allow arbitrary expressions for shards, not just range, list
and hash?

That seems like a 2.0 feature. It also doesn't seem necessary to
support it for the moderately skilled user; that is, requiring a special
C sharding function for this seems fine to me.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM233b55d74476db710880fe927b8e7bde6b1eb217bf57c23e467b1d9dc9cc7758421dc2a3cfd84c6d66d039c4d73ac49f@asav-2.01.com

#19

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Josh Berkus (#18)

Re: Horizontal scalability/sharding

On 08/31/2015 01:16 PM, Josh Berkus wrote:

All, Bruce:

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design. For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

We see this already, under very high concurrency (lots of connections,
many cores) we often see a significant drop in performance that is not
related to IO in any meaningful way.

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Josh Berkus (#18)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that. However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people. The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries. As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote. As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need. As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

It's true that postgres_fdw doesn't do everything we need yet. The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling. Aggregate pushdown
is waiting on upper planner path-ification. DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw. But all
of these things have been worked on. Patches for many of them have
already been posted. They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors. But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.

Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created. But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it. And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution. The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason. This makes the XC approach look rather brittle
by comparison. I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.

Particularly, I'm concerned that we already have two projects in process
aimed at horizontal scalability, and it seems like we could bring either
(or both) projects to production quality MUCH faster than we could make
an FDW-based solution work. These are:

* pg_shard
* BDR

It seems worthwhile, just as a thought experiment, if we can get where
we want using those, faster, or by combining those with new FDW features.

I think it's abundantly clear that we need a logical replication
solution as part of any horizontal scalability story. People will
want to do things like have 10 machines with each piece of data on 3
of them, and there won't be any reasonable way of doing that without
logical replication. I assume that BDR, or some technology derived
from it, will end up in core and solve that problem. I had actually
hoped we were going to get that in 9.5, but it didn't happen that way.
Still, I think that getting first single-master, and then eventually
multi-master, logical replication in core is absolutely critical. And
not just for sharding specifically: replicating your whole database to
several nodes and load-balancing your clients across them isn't
sharding, but it does give you read scalability and is a good fit for
people with geographically dispersed data with good geographical
locality. I think a lot of people will want that.

I'm not quite sure yet how we can marry declarative partitioning and
better FDW-pushdown and logical replication into one seamless, easy to
deploy solution that requires very low administrator effort. But I am
sure that each of those things, taken individually, is very useful,
and that being able to construct a solution from those building blocks
would be a big improvement over what we have today. I can't imagine
that trying to do one monolithic project that provides all of those
things, but only if you combine them in the specific way that the
designer had in mind, is ever going to be successful. People _will_
want access to each of those features in an unbundled fashion. And,
trying to do them altogether leads to trying to solve too many
problems at once. I think the history of Postgres-XC is a cautionary
tale there.

I don't really understand how pg_shard fits into this equation. It
looks to me like it does some interesting things but, for example, it
doesn't support JOIN pushdown, and suggests that you use the
proprietary CitusDB engine if you need that. But I think JOIN
pushdown is something we want to have in core, not something where we
want to point people to proprietary alternatives. And it has some
restrictions on INSERT statements - they have to contain only values
which are constants or which can be folded to constants. I'm just
guessing, but I bet that's probably due to some limitation which
pg_shard, being out of core, has difficulty overcoming, but we can do
better in core. Basically I guess I expect much of what pg_shard does
to be subsumed as we improve FDWs, but maybe not all of it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Sumedh Pathak

sumedh@citusdata.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

Hi Bruce,

Sumedh from Citus Data here.

August, 2015: While speaking at SFPUG, Citus Data approached me about joining

the FDW sharding team. They have been invited to the September 1 meeting,
as have the XC and XL people.

I'd like to add a clarification. We already tried the FDW APIs for pg_shard
two years ago and failed. We figured sharing our learnings could contribute
to the technical discussion and that's why we wanted to be in the call.

Ozgun summarized our technical learnings in this design document:
https://goo.gl/vJWF85

In the document, we focused on one of the four learnings we had with FDW
APIs. For us, we switched to the hook API based approach, and things went
smoothly from there.

Best,
Sumedh

On Sat, Aug 29, 2015 at 7:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

I have recently increased my public statements about the idea of adding
horizontal scaling/sharding to Postgres. I wanted to share with hackers
a timeline of how we got here, and where I think we are going in the
short term:

2012-2013: As part of writing my scaling talk
(http://momjian.us/main/presentations/overview.html#scaling), studying
Oracle RAC, and talking to users, it became clear that an XC-like
architecture (sharding) was the only architecture that was going to allow
for write scaling.

Users and conference attendees I talked to were increasingly concerned
about the ability of Postgres to scale for high write volumes. They didn't
necessarily need that scale now, but they needed to know they could get
it if they wanted it, and wouldn't need to switch to a new database in
the future. This is similar to wanting a car that can get you on a highway
on-ramp fast --- even if you don't need it, you want to know it is there.

2014: I started to shop around the idea that we could use FDWs,
parallelism, and a transaction/snapshot manager to get XC features
as built-in to Postgres. (I don't remember where the original idea
came from.) It was clear that having separate forks of the source code
in XC and XL was never going to achieve critical mass --- there just
aren't enough people who need high right scale right now, and the fork
maintenance overhead is a huge burden.

I realized that we would never get community acceptance to dump the XC
(or XL) code needed for sharding into community Postgres, but with FDWs,
we could add the features as _part_ of improving FDWs, which would benefit
FDWs _and_ would be useful for sharding. (We already see some of those
FDW features in 9.5.)

October, 2014: EDB and NTT started working together in the community
to start improving FDWs as a basis for an FDW-based sharding solution.
Many of the 9.5 FDW improvements that also benefit sharding were developed
by a combined EDB/NTT team. The features improved FDWs independent of
sharding, so they didn't need community buy-in on sharding to get them
accepted.

June, 2015: I attended the PGCon sharding unconference session and
there was a huge discussion about where we should go with sharding.
I think the big take-away was that most people liked the FDW approach,
but had business/customer reasons for wanting to work on XC or XL because
those would be production-ready faster.

July, 2015: Oleg Bartunov and his new company Postgres Professional (PP)
started to think about joining the FDW approach, rather than working on
XL, as they had stated at PGCon in June. A joint NTT/EDB/PP phone-in
meeting is scheduled for September 1.

August, 2015: While speaking at SFPUG, Citus Data approached me about
joining the FDW sharding team. They have been invited to the September
1 meeting, as have the XC and XL people.

October, 2015: EDB is sponsoring a free 3-hour summit about FDW sharding
at the PG-EU conference in Vienna. Everyone is invited, but it is hoped
most of the September 1 folks can attend.

February, 2016: Oleg is planning a similar meeting at their February
Moscow conference.

Anyway, I wanted to explain the work that has been happening around
sharding. As things move forward, I am increasingly convinced that write
scaling will be needed soon, that the XC approach is the only reasonable
way to do it, and that FDWs are the cleanest way to get it into community
Postgres.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
*Sumedh Pathak*
Citus Data
650.422.9797
sumedh@citusdata.com

#22

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 08/31/2015 02:47 PM, Robert Haas wrote:

On Mon, Aug 31, 2015 at 4:16 PM, Josh Berkus <josh@agliodbs.com> wrote:

First, let me put out there that I think the horizontal scaling project
which has buy-in from the community and we're working on is infinitely
better than the one we're not working on or is an underresourced fork.
So we're in agreement on that. However, I think there's a lot of room
for discussion; I feel like the FDW approach was decided in exclusive
meetings involving a very small number of people. The FDW approach
*may* be the right approach, but I'd like to see some rigorous
questioning of that before it's final.

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries. As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote. As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need.

Well, maybe. If you look at pg_shard, you'll see that it works by
multiplexing writes to all copies. There's a good reason to do that; it
allows you to have a tight feedback loop between the success of writes
and the availability of "good" nodes. If you're depending on a separate
replication system to handle getting row copies from one shard to
another, then you need a different way to deal with bad nodes and with
inconsistency between copies of shards. That's why the existing
multinode non-relational databases don't separate replication from
writes, either.

For that matter, if what you want is transactional fully ACID sharding,
I really don't see a way to do it via BDR, since BDR is purely
asynchronous replication, as far as I know.

Also, if we want BDR to do this, that's pretty far afield of what BDR is
currently capable of, so someone will need to put serious work into it
rather than just assuming functionality will show up.

As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

Well, query hooks are also a capability which we already have, and is
mature. Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.

Also consider that (3) includes both reads and writes.

I think it's abundantly clear that we need a logical replication
solution as part of any horizontal scalability story. People will
want to do things like have 10 machines with each piece of data on 3
of them, and there won't be any reasonable way of doing that without
logical replication. I assume that BDR, or some technology derived
from it, will end up in core and solve that problem. I had actually
hoped we were going to get that in 9.5, but it didn't happen that way.
Still, I think that getting first single-master, and then eventually
multi-master, logical replication in core is absolutely critical. And
not just for sharding specifically: replicating your whole database to
several nodes and load-balancing your clients across them isn't
sharding, but it does give you read scalability and is a good fit for
people with geographically dispersed data with good geographical
locality. I think a lot of people will want that.

Well, the latter thing is something which BDR is designed for, so all
that needs to happen with that is getting the rest of the plumbing into
core. Also documentation, packaging, productization, etc. But the
heavy lifting has already been done.

However, integrating BDR with sharding has some major design issues
which aren't trivial and may be unresolvable, per above.

I'm not quite sure yet how we can marry declarative partitioning and
better FDW-pushdown and logical replication into one seamless, easy to
deploy solution that requires very low administrator effort. But I am
sure that each of those things, taken individually, is very useful,
and that being able to construct a solution from those building blocks
would be a big improvement over what we have today. I can't imagine
that trying to do one monolithic project that provides all of those
things, but only if you combine them in the specific way that the
designer had in mind, is ever going to be successful. People _will_
want access to each of those features in an unbundled fashion. And,
trying to do them altogether leads to trying to solve too many
problems at once. I think the history of Postgres-XC is a cautionary
tale there.

Yes. It's also a cautionary tale about not skipping over major design
elements (like HA and DR) until after version 1.0, which is one of the
reasons I'm harping on certain things here. I don't want us to repeat
those mistakes.

I don't really understand how pg_shard fits into this equation. It
looks to me like it does some interesting things but, for example, it
doesn't support JOIN pushdown, and suggests that you use the
proprietary CitusDB engine if you need that. But I think JOIN
pushdown is something we want to have in core, not something where we
want to point people to proprietary alternatives. And it has some
restrictions on INSERT statements - they have to contain only values
which are constants or which can be folded to constants. I'm just
guessing, but I bet that's probably due to some limitation which
pg_shard, being out of core, has difficulty overcoming, but we can do
better in core. Basically I guess I expect much of what pg_shard does
to be subsumed as we improve FDWs, but maybe not all of it.

pg_shard provides an alternate implementation based on planner hooks
instead of FDWs. Even if you pursue an FDW-based design, you should
look at (a) why the Citus team found FDWs to be unworkable and (b) what
got implemented in planner hooks. Otherwise you're liable to repeat the
exact same "learning experience".

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMb614ddeda19df34f0140c0c3646246036c1fdbda090e19e02d39b9d501072eea9d3a83649b8e9b5fc2d02fc7434ef395@asav-3.01.com

#23

Marc Munro

marc@bloodnok.com

over 10 years ago

In reply to: Josh Berkus (#22)

Re: Horizontal scalability/sharding

On Mon, 2015-08-31 at 22:21 +0000, Robert Haas wrote:

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries. [...]

I believe there is another aspect to sharding that I have not yet seen
mentioned, which is one of connection routing.

One use case that I have been involved in, is to simply partition the
application into entirely, or almost entirely, separate datasets running
on separate databases with little or no need for queries to access
remote data.

This allows each database to deal only with connections from clients
that actually want its local data, greatly reducing the number of
connections on any individual database. If this works for your
application, your ability to scale is great.

The pain point comes from trying to route queries to the correct
database. Inevitably, everyone taking this route builds custom
connection-selection layers into their apps.

It seems to me that even with the more sophisticated types of sharding
being discussed here, the ability to conditionally route a
query/connection to a suitable starting database could be quite
beneficial.

Although this is probably a job for the pgbouncer/pgpool developers
rather than the hackers on this list, this thread seems to be a good
place to mention it.

__
Marc

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: E1ZWXTV-0006jn-JP@malur.postgresql.orgReference msg id not found: E1ZWXTV-0006jn-JP@malur.postgresql.org | Resolved by subject fallback

#24

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Josh Berkus (#22)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:

As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

Well, query hooks are also a capability which we already have, and is
mature. Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific. I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start. :-)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Bruce Momjian (#24)

Re: Horizontal scalability/sharding

Bruce Momjian wrote:

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific. I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start. :-)

Having lots of interested people doesn't help with some problems,
though. The Citus document says:

And the issue with these four limitations wasn't with foreign
data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
quite happy with the contract FDWs provide. The problem was that
we were trying to retrofit an API for something that it was
fundamentally not designed to do.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Bruce Momjian (#24)

Re: Horizontal scalability/sharding

On 2015/09/01 9:54, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 05:10:11PM -0700, Josh Berkus wrote:

As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

Well, query hooks are also a capability which we already have, and is
mature. Citus has already posted about why they chose to use them instead.

As long as you recognize that the FDW API (not just the existing fdws)
will need to expand to make this work, it's a viable path.

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85.

My top priority is postgres_fdw join pushdown, but I also plan to work
on update pushdown [1]https://commitfest.postgresql.org/4/162/ for 9.6, which couldn't make it into 9.5. I
think that would resolve the following issue mentioned in the document:

UPDATE and DELETE operations are performed by first fetching
records from the table scanning functions, and then going over
the fetched records. If the user wanted to update a single row,
this involved first pulling rows and then updating related
records.

Best regards,
Etsuro Fujita

[1]: https://commitfest.postgresql.org/4/162/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Pavan Deolasee

pavan.deolasee@gmail.com

over 10 years ago

In reply to: Robert Haas (#20)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 3:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that sharding consists of (1) breaking your data set up
into shards, (2) possibly replicating some of those shards onto
multiple machines, and then (3) being able to access the remote data
from local queries. As far as (1) is concerned, we need declarative
partitioning, which is being worked on by Amit Langote. As far as (2)
is concerned, I hope and expect BDR, or technology derived therefrom,
to eventually fill that need. As far as (3) is concerned, why
wouldn't we use the foreign data wrapper interface, and specifically
postgres_fdw? That interface was designed for the explicit purpose of
allowing access to remote data sources, and a lot of work has been put
into it, so it would be highly surprising if we decided to throw that
away and develop something completely new from the ground up.

It's true that postgres_fdw doesn't do everything we need yet. The
new join pushdown hooks aren't used by postgres_fdw yet, and the API
itself has some bugs with EvalPlanQual handling. Aggregate pushdown
is waiting on upper planner path-ification. DML pushdown doesn't
exist yet, and the hooks that would enable pushdown of ORDER BY
clauses to the remote side aren't being used by postgres_fdw. But all
of these things have been worked on. Patches for many of them have
already been posted. They have suffered from a certain amount of
neglect by senior hackers, and perhaps also from a shortage of time on
the part of the authors. But an awful lot of the work that is needed
here has already been done, if only we could get it committed.
Aggregate pushdown is a notable exception, but abandoning the foreign
data wrapper approach in favor of something else won't fix that.

Postgres-XC developed a purpose-built system for talking to other
nodes instead of using the FDW interface, for the very good reason
that the FDW interface did not yet exist at the time that Postgres-XC
was created. But several people associated with the XC project have
said, including one on this thread, that if it had existed, they
probably would have used it. And it's hard to see why you wouldn't:
with XC's approach, the remote data source is presumed to be
PostgreSQL (or Postgres-XC/XL/X2/whatever); and you can only use the
facility as part of a sharding solution. The FDW interface can talk
to anything, and it can be used for stuff other than sharding, like
making one remote table appear local because you just happen to want
that for some reason. This makes the XC approach look rather brittle
by comparison. I don't blame the XC folks for taking the shortest
path between two points, but FDWs are better, and we ought to try to
leverage that.

In my discussions on this topic with various folks including Robert, I've
conceded that if FDW was available when XC was first written, in all
likelihood we would have used and extended that interface. But that wasn't
the case and we did what we thought was the best solution at that time,
given the resources and the schedule. To be honest, when XC project was
started, I was quite skeptical about the whole thing given the goal was to
built something which can replace Oracle RAC with may be less than 1%
resources of what Oracle must have invested in building RAC. The lack of
resources at the start of the project keeps showing up in the quality
issues that users report from time to time. Having said that, I am quite
satisfied with what we have been able to build within the constraints.

But FDW is just one part of the story. There is this entire global
consistency problem that would require something like GTM to give out XIDs
and snapshots, atomicity which would require managing transactions across
multiple shards, join pushdowns when all data is not available locally,
something that XL is attempting to solve with datanode-datanode exchange of
information, other global states such as sequences, replicating some part
of the data to multiple shards for efficient operations, ability to
add/remove shards with least disruption, globally consistent
backups/restore. XC/XL has attempted to solve each of them to some extent.
I don't claim that they are completely solved and there are no corner cases
left, but we have made fairly good progress on each of them.

My worry is that if we start implementing them again from scratch, it will
take a few years before we get them in a usable state. What XC/XL lacked is
probably a Robert Haas or a Tom Lane who could look at the work and suggest
major edits. If that had happened, the quality of the product could have
been much better today. I don't mean to derate the developers who worked on
XC/XL, but there is no harm in accepting that if someone with a much better
understanding of the whole system was part of the team, that would have
positively impacted the project. Is that an angle worth exploring? Does it
make sense to commit some more resources to say XC or XL and try to improve
the quality of the product even further? To be honest, XL is in far far
better shape (haven't really tried XC in a while) and some more
QA/polishing can make it production ready much sooner.

Yet another possibility is rework the design such that only coordinator
needs to be a fork of PostgreSQL but the shards are all PostgreSQL
instances, queried using standard client APIs. That would reduce the code
that needs to go in the core to build the entire scalable system and also
shorten the timeline considerably.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Bruce Momjian (#24)

Re: Horizontal scalability/sharding

On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Josh Berkus (#18)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design. For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

Well, I highlighted write IO for sharding because sharding is the only
solution that allows write scaling. If we want to scale CPU, we are
better off using server parallelism, and to scale CPU and RAM, a
multi-master/BDR solution seems best. (Multi-master doesn't do write
scaling because you eventually have to write all the data to each node.)

For read-scaling, having the read queries go to streaming slave seems
best.

Only sharding allows for write scaling, but only in very limited
use-cases, where the queries use multiple shards and the
sharding/communication overhead doesn't overwhelm the benefits.

For these reasons, I think sharding has a limited use, and hence, I
don't think the community will be willing to add a lot of code just to
enable auto-sharding. I think it has to be done in a way that adding
sharding also gives other benefits, like better FDWs and cross-node ACID
control.

In summary, I don't think adding a ton of code just to do sharding will
be acceptable. A corollary of that, is that if FDWs are unable to
provide useful sharding, I don't see an acceptable way of adding
built-in sharding to Postgres.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Andres Freund (#28)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 10:15:27AM +0200, Andres Freund wrote:

On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.

Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another. ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Pavan Deolasee (#27)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:

My worry is that if we start implementing them again from scratch, it will take
a few years before we get them in a usable state. What XC/XL lacked is probably
a Robert Haas or a Tom Lane who could look at the work and suggest major edits.
If that had happened, the quality of the product could have been much better
today. I don't mean to derate the developers who worked on XC/XL, but there is
no harm in accepting that if someone with a much better understanding of the
whole system was part of the team, that would have positively impacted the
project. Is that an angle worth exploring? Does it make sense to commit some
more resources to say XC or XL and try to improve the quality of the product
even further? To be honest, XL is in far far better shape (haven't really tried
XC in a while) and some more QA/polishing can make it production ready much
sooner.

There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.

(I think we got a little too cute in enhancing existing Postgres
capabilities to add partitioning, and I think we need to fix that.
Hopefully we don't do the same thing with sharding.)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 10 years ago

In reply to: Bruce Momjian (#30)

Re: Horizontal scalability/sharding

On 01/09/15 21:41, Bruce Momjian wrote:

Well, reworking our partitioning system is one of the things required
for sharding, so at least we will clean up one mess while we create
another. ;-)

Seem my post to Josh Berkus just now --- I think if we don't use FDWs,
that sharding is such a limited use-case that we will not implement it
inside of Postgres.

I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and
also perhaps easing admin activities (archiving/removing old stuff etc).

Sharding is a about parallelism and redundancy...copies of stuff in
different places and concurrent access by virtue of it being on
different nodes!

Now *maybe* FDW is a good way to approach this, but really would be nice
to see a more rigorous analysis (I note that like XC and XL, Greenplum
looked at the existing mechanisms around at the time and ended up
writing their own). Now I'm aware that things have moved on - but I
think there needs to be a proper discussion about design and what we
think distributed data/sharding etc should provide *before* grabbing
hold of FDW as *the answer*!

Regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Mason S

masonlists@gmail.com

over 10 years ago

In reply to: Alexander Korotkov (#16)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 3:08 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Mon, Aug 31, 2015 at 9:48 PM, Mason S <masonlists@gmail.com> wrote:

We also a bit disappointed by Huawei position about CSN patch, we hoped
to use for our XTM.

Disappointed in what way? Moving to some sort of CSN approach seems to
open things up for different future ideas. In the short term, it would mean
replacing potentially large snapshots and longer visibility checks. In the
long term, perhaps CSN could help simplify the design of multi-master
replication schemes.

We are disappointed because at PGCon talk Huawei announced publishing of
their CSN patch and further work in this direction together with community.
However, it's even not published yet despite all the promises. Nobody from
Huawei answers CSN thread in the hackers.
So, I think we got nothing from Huawei except teasing and should rely only
on ourselves. That is disappointing.

Oh, I see. I was in contact with some of the folks involved a few weeks
ago. The intention is to do a bit more work on the patch and then post it.
This should be happening "soon".

Thanks,

Mason

#34

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Alvaro Herrera (#25)

Re: Horizontal scalability/sharding

On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific. I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start. :-)

Having lots of interested people doesn't help with some problems,
though. The Citus document says:

And the issue with these four limitations wasn't with foreign
data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
quite happy with the contract FDWs provide. The problem was that
we were trying to retrofit an API for something that it was
fundamentally not designed to do.

I had a chance to review the Citus Data document just now:

https://goo.gl/vJWF85

Particularly, it links to this document, which is clearer about the
issues they are trying to solve:

https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations. The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Mason S

masonlists@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#34)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Aug 31, 2015 at 11:23:58PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

My hope is that many FDW improvements will benefit sharding and
non-sharding workloads, but I bet some improvements are going to be
sharding-specific. I would say we are still in the exploratory stage,
but based on the number of people who care about this feature and want
to be involved, I think we are off to a very good start. :-)

Having lots of interested people doesn't help with some problems,
though. The Citus document says:

And the issue with these four limitations wasn't with foreign
data wrappers. We wrote mongo_fdw and cstore_fdw, and we're
quite happy with the contract FDWs provide. The problem was that
we were trying to retrofit an API for something that it was
fundamentally not designed to do.

I had a chance to review the Citus Data document just now:

https://goo.gl/vJWF85

Particularly, it links to this document, which is clearer about the
issues they are trying to solve:

https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations. The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.

For efficient internodes joins with row shipping, FDWs may also not be easy
to do. Maybe it is possible if we optionally pass in lists of other nodes
and information about how they are partitioned so data knows where to get
shipped.

A challenge for planning with arbitrary copies of different shards is that
sometimes you may be able to push down joins, sometimes not. Planning and
execution get ugly. Maybe this can be simplified by parent-child tables
following the same partitioning scheme.

Mason

#36

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Bruce Momjian (#31)

Re: Horizontal scalability/sharding

On 09/01/2015 02:48 AM, Bruce Momjian wrote:

On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:

There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.

So that we have XL again?

This reads like: We don't want to merge a bunch of code that a non .Org
project wrote. Instead we want to write a bunch of code that will all
but duplicate what that other project wrote because.... ?

Don't get me wrong, I am all about this goal and if XL truly doesn't fit
that's cool but I think the idea that our project will somehow reinvent
a better wheel is naive.

(I think we got a little too cute in enhancing existing Postgres
capabilities to add partitioning, and I think we need to fix that.
Hopefully we don't do the same thing with sharding.)

Not cute, lazy. We implemented the barest minimum within our existing
infrastructure and we have been kicking ourselves ever since.

Sincerely,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Mark Kirkwood (#32)

Re: Horizontal scalability/sharding

On 09/01/2015 02:58 AM, Mark Kirkwood wrote:

On 01/09/15 21:41, Bruce Momjian wrote:

I'm thinking that partitioning and sharding are two different things:

Partitioning is about reducing the amount of table data accessed and
also perhaps easing admin activities (archiving/removing old stuff etc).

Sharding is a about parallelism and redundancy...copies of stuff in
different places and concurrent access by virtue of it being on
different nodes!

In our world, they are complimentary. Consider partitioning that uses
FDW tables with proper plan push down etc....

Now *maybe* FDW is a good way to approach this, but really would be nice
to see a more rigorous analysis (I note that like XC and XL, Greenplum
looked at the existing mechanisms around at the time and ended up
writing their own). Now I'm aware that things have moved on - but I
think there needs to be a proper discussion about design and what we
think distributed data/sharding etc should provide *before* grabbing
hold of FDW as *the answer*!

Agreed.

Sincerely,

Joshua D. Drake

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

David Fetter

david@fetter.org

over 10 years ago

In reply to: Andres Freund (#28)

Re: Horizontal scalability/sharding

On Tue, Sep 01, 2015 at 10:15:27AM +0200, Andres Freund wrote:

On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.

Whatever we decide on, we can only count on built-in multi-node being
adopted if all the needed bits needed ship with every PostgreSQL
installation. If we require people do Install More Software™ in order
to get a feature, we're going to lose a majority of our potential
base.

If it turns out we need the PostgreSQL FDW, and I believe some
consensus is starting to gel around that, I can see not installing it
in template1 by default. Even that's a questionable decision, as
merely having the software in place does not credibly increase the
attack surface, and does up the installation procedures by a fallible
step.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Pavan Deolasee (#27)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

My worry is that if we start implementing them again from scratch, it will
take a few years before we get them in a usable state. What XC/XL lacked is
probably a Robert Haas or a Tom Lane who could look at the work and suggest
major edits. If that had happened, the quality of the product could have
been much better today. I don't mean to derate the developers who worked on
XC/XL, but there is no harm in accepting that if someone with a much better
understanding of the whole system was part of the team, that would have
positively impacted the project. Is that an angle worth exploring? Does it
make sense to commit some more resources to say XC or XL and try to improve
the quality of the product even further? To be honest, XL is in far far
better shape (haven't really tried XC in a while) and some more QA/polishing
can make it production ready much sooner.

From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be. If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it. If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them. However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL. The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good. The reality is,
also, that horizontal scalability isn't an optional feature. There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core. That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Andres Freund (#28)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:

Uh, we already have a list of things we need to add to FDWs to make them
work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red flag
you want to confirm that everyone agrees that major FDW improvements are
a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient right
now is pretty far from implying that all of them agree that the FDW API
is the way to go.

I'm not sure myself. If it works out it's going to save us some work and
make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current partitioning
implemenentation. Yes, it kinda works, but it's hard to get started, it
doesn't support too many features and you're kind afraid your relatives
will see what you've done.

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown. Those patches have
been around for some time, but progress has been slow. Core
infrastructure exists to allow sort pushdown, but nobody's done
anything with it. Aggregate pushdown hasn't happened yet due to the
dependency on upper planner path-ification, but it's not as if some
alternative to the FDW interface is going to dodge that problem.

It would be a bad idea to cling blindly to the FDW infrastructure if
it's fundamentally inadequate to do what we want. On the other hand,
it would also be a bad idea to set about recreating it without a
really good reason, and - just to take one example - the fact that it
doesn't currently push down DML operations to the remote side is not a
really good reason to rewrite the whole thing. On the contrary, it's
a reason to put some energy into the already-written patch which
implements that optimization.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Robert Haas (#39)

Re: Horizontal scalability/sharding

On 09/01/2015 09:08 AM, Robert Haas wrote:

On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee

From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be. If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it. If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them.

I think that is pretty much anybody in the industry's point of view and
I agree with it. Although there is definitely an argument to be had
about whether it is a module or it is in core proper, no matter what it
should be a .Org project.

You are also correct on the replication statement. Replication should
have never been outside of core and I say that as the company that wrote
a replication solution that even Fortune 50 companies used.

I think that we should consider that there are very smart, very talented
engineers already devoting time to their respective communities (XL/XC)
that may very much enjoy being part of our community so we don't have to
reinvent the wheel, again.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#34)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.

I do think we are going to want to support that. All the people I've
talked to about parallel and distributed query processing agree that
you need to do that sort of thing to get really good and scalable
performance. I think that we could make a lot of headway as compared
with the status quo just by implementing more pushdown optimizations
than we have today. Right now, SELECT COUNT(*) FROM table will suck
back the whole remote table and count the rows locally, and that's
stupid. We can fix that case with better pushdown logic. We can also
fix the case of N-way join nests where the joins are either on the
partitioning key or to replicated tables. But suppose you have a join
between two tables which are sharded across the cluster but not on the
partitioning key. There's no way to push the join down, so all the
work comes back to the coordinator, which is possibly OK if such
queries are rare, but not so hot if they are frequent.

To leverage the full CPU power of the cluster in such a case, you need
to be able to shuffle data around between the nodes. You pick one of
the two tables being joined, and based on the partitioning key of that
table, each node examines the other table and, for each row, sends it
to the machine where it MIGHT have one or more join partners. Then
each node can join its shard of the first table against the rows from
the second table that were sent to it.

Now the question is, where should the code that does all of this live?
postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Mason S (#35)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 7:01 AM, Mason S <masonlists@gmail.com> wrote:

For efficient internodes joins with row shipping, FDWs may also not be easy
to do. Maybe it is possible if we optionally pass in lists of other nodes
and information about how they are partitioned so data knows where to get
shipped.

A challenge for planning with arbitrary copies of different shards is that
sometimes you may be able to push down joins, sometimes not. Planning and
execution get ugly. Maybe this can be simplified by parent-child tables
following the same partitioning scheme.

This gets at a problem which Ozgun also mentioned in his Google
document, and which I also discussed with Etsuro Fujita at PGCon: good
query planning requires good metadata, and we don't really have that
today.

I think that a big part of the goal of the declarative partitioning
work that Amit Langote has recently been undertaking is to provide a
catalog representation of the partitioning structure that is easy to
work with, as opposed to just having a bunch of CHECK constraints that
you have to try to reason about. That's one part of the solution.

You also might need to know more about the remote table than is
captured by the column and data type list. Most particularly, you
might want to know what indexes exist on the remote side, but
currently, to figure out that out, you'd need to send queries to
retrieve that information to the remote side every time you do
planning, or maybe you could contrive a session-lifespan cache.
That's pretty annoying.

One idea for solving this problem is to allow CREATE INDEX on foreign
tables, but I don't like that much. There's no guarantee that the
remote side is a PostgreSQL instance, and if it isn't, the relevant
details about the indexes that exist may not be convenient to
represent in our catalogs. Heck, that can be true even if it is a
PostgreSQL instance, if the remote side relies on an AM or a function
that doesn't exist locally. But even if both sides are PostgreSQL
instances using only btree indexes on raw columns, now you've put the
burden on the DBA to make sure that the index definitions on the local
and remote sides match, and that's a pain in the neck.

What seems better to me is to allow ANALYZE of a foreign table a place
to record an arbitrary blob of metadata about the remote side that it
can then get access to during planning. Then, you can record details
about indexes, or statistics that don't fit into the mold of
pg_statistic, or really, anything else you're going to need to figure
out the best plan, and if the DBA changes the configuration on the
remote side, they don't need to update the local configuration to
match; a re-ANALYZE will do the trick.

I'm open to other ideas as well. Repartitioning will also arise for
an all-local parallel join, and it would be nice if the planner smarts
could be shared between that case and the remote-table case.
Therefore, while I suspect that some of the logic here will end up
inside one or more FDWs, I'm pretty confident that a significant chunk
of it needs to go into the core optimizer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/01/2015 02:39 AM, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:

I'm also going to pontificate that, for a future solution, we should not
focus on write *IO*, but rather on CPU and RAM. The reason for this
thinking is that, with the latest improvements in hardware and 9.5
improvements, it's increasingly rare for machines to be bottlenecked on
writes to the transaction log (or the heap). This has some implications
for system design. For example, solutions which require all connections
to go through a single master node do not scale sufficiently to be worth
bothering with.

Well, I highlighted write IO for sharding because sharding is the only
solution that allows write scaling. If we want to scale CPU, we are
better off using server parallelism, and to scale CPU and RAM, a
multi-master/BDR solution seems best. (Multi-master doesn't do write
scaling because you eventually have to write all the data to each node.)

You're assuming that our primary bottleneck for writes is IO. It's not
at present for most users, and it certainly won't be in the future. You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Currently, CPU resources and locking are the primary bottlenecks on
writing for the vast majority of the hundreds of servers I tune every
year. This even includes AWS, with EBS's horrible latency; even in that
environment, most users can outstrip PostgreSQL's ability to handle
requests by getting 20K PRIOPs.

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches

The only place where IO becomes the bottleneck is for the
batch-processing, high-throughput DW case ... and I would argue that
existing forks already handle that case.

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes. Any solution which does not is merely an academic curiosity.

For these reasons, I think sharding has a limited use, and hence, I
don't think the community will be willing to add a lot of code just to
enable auto-sharding. I think it has to be done in a way that adding
sharding also gives other benefits, like better FDWs and cross-node ACID
control.

In summary, I don't think adding a ton of code just to do sharding will
be acceptable. A corollary of that, is that if FDWs are unable to
provide useful sharding, I don't see an acceptable way of adding
built-in sharding to Postgres.

So, while I am fully in agreement with you that having side benefits to
our sharding tools, I think you're missing the big picture entirely. In
a few years, clustered/sharded PostgreSQL will be the default
installation, or we'll be a legacy database. Single-node and
single-master databases are rapidly becoming history.

From my perspective, we don't need an awkward, limited, bolt-on solution
for write-scaling. We need something which will become core to how
PostgreSQL works. I just don't see us getting there with the described
FDW approach, which is why I keep raising issues with it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM3c734c1b755f8f80a903feaf1c608f4247b27c25f462f9b7c65e3ca4bf7c26cf462bfcdf7b798d3fd4653520f5bf5bf1@asav-1.01.com

#45

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Josh Berkus (#44)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:

You're assuming that our primary bottleneck for writes is IO. It's not
at present for most users, and it certainly won't be in the future. You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Your experience doesn't match mine. I find that it's frequently
impossible to get the system to use all of the available CPU capacity,
either because you're bottlenecked on locks or because you are
bottlenecked on the I/O subsystem, and with the locking improvements
in newer versions, the former is becoming less and less common.
Amit's recent work on scalability demonstrates this trend: he goes
looking for lock bottlenecks, and finds problems that only occur at
128+ concurrent connections running full tilt. The patches show
limited benefit - a few percentage points - at lesser concurrency
levels. Either there are other locking bottlenecks that limit
performance at lower client counts but which mysteriously disappear as
concurrency increases, which I would find surprising, or the limit is
somewhere else. I haven't seen any convincing evidence that the I/O
subsystem is the bottleneck, but I'm having a hard time figuring out
what else it could be.

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches

I do agree that all of those things are problems, FWIW.

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes. Any solution which does not is merely an academic curiosity.

I think the right solution to those problems is to attack them
head-on. Sharding solutions should cater to use cases where using all
the resources of one machine isn't sufficient no matter how
efficiently we do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Josh Berkus (#44)

Re: Horizontal scalability/sharding

On 09/01/2015 10:06 AM, Josh Berkus wrote:

On 09/01/2015 02:39 AM, Bruce Momjian wrote:

On Mon, Aug 31, 2015 at 01:16:21PM -0700, Josh Berkus wrote:

Our real future bottlenecks are:

* ability to handle more than a few hundred connections

This, 1000 times this. No a connection pooler doesn't help, we are using
1000 connections WITH a connection pooler. The amount of data centric,
always connected (even if stateless) apps that are out there is
astounding. Our performance curve is better (especially since Robert's
fantastic work on 9.2) but it isn't where a lot of the demanding users
need to be.

We are also seeing a vast uptick in always connected apps that aren't
web centric (or the client at least isn't). In other words we are seeing
situations where 1000 connections is 1000 users, connected.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/01/2015 10:17 AM, Robert Haas wrote:

On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:

Any sharding solution worth bothering with will solve some or all of the
above by extending our ability to process requests across multiple
nodes. Any solution which does not is merely an academic curiosity.

I think the right solution to those problems is to attack them
head-on. Sharding solutions should cater to use cases where using all
the resources of one machine isn't sufficient no matter how
efficiently we do it.

As long as "all the resources" != "just IO", I'm completely on board
with that. The reason I raised this is that the initial FDW-based
proposals pretty much scale IO and nothing else.

pg_shard also currently only scales IO, but they're working on that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM9a31ef972eaf3fb0d1b67526dc2edfbfaca7b4108bda18196d9dd40a9099fac12b523af4459a032621990162c29b660b@asav-3.01.com

#48

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#45)

Re: Horizontal scalability/sharding

Hi,

On 09/01/2015 07:17 PM, Robert Haas wrote:

On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:

You're assuming that our primary bottleneck for writes is IO. It's
not at present for most users, and it certainly won't be in the
future. You need to move your thinking on systems resources into
the 21st century, instead of solving the resource problems from 15
years ago.

Your experience doesn't match mine. I find that it's frequently
impossible to get the system to use all of the available CPU
capacity, either because you're bottlenecked on locks or because you
are bottlenecked on the I/O subsystem, and with the locking
improvements in newer versions, the former is becoming less and less
common. Amit's recent work on scalability demonstrates this trend: he
goes looking for lock bottlenecks, and finds problems that only occur
at 128+ concurrent connections running full tilt. The patches show
limited benefit - a few percentage points - at lesser concurrency
levels. Either there are other locking bottlenecks that limit
performance at lower client counts but which mysteriously disappear
as concurrency increases, which I would find surprising, or the limit
is somewhere else. I haven't seen any convincing evidence that the
I/O subsystem is the bottleneck, but I'm having a hard time figuring
out what else it could be.

Memory bandwidth, for example. It's quite difficult to spot, because the
intuition is that memory is fast, but thanks to improvements in storage
(and stagnation in RAM bandwidth), this is becoming a significant issue.

Process-management overhead is another thing we tend to ignore, but once
you get to many processes all willing to work at the same time, you need
to account for that.

Of course, this applies differently to different sharding use cases. For
example analytics workloads have serious issues with memory bandwidth,
but not so much with process management overhead (because the number of
connections is usually about number of cores). Use cases with many
clients (in web-scale use cases) tends to run into both (all the
processes also have to share all the caches, killing them).

I don't know if sharding can help solving (or at least improve) these
issues. And if sharding in general can, I don't know if it still holds
for FDW-based solution.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#48)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Memory bandwidth, for example. It's quite difficult to spot, because the
intuition is that memory is fast, but thanks to improvements in storage (and
stagnation in RAM bandwidth), this is becoming a significant issue.

I'd appreciate any tips on how to spot problems of this type. But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time. If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%. Rather, it would show up as CPU time
hitting 100% while tps remains low.

Process-management overhead is another thing we tend to ignore, but once you
get to many processes all willing to work at the same time, you need to
account for that.

Any tips on spotting problems in that area?

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Robert Haas (#49)

Re: Horizontal scalability/sharding

On 2015-09-01 14:11:21 -0400, Robert Haas wrote:

On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Memory bandwidth, for example. It's quite difficult to spot, because the
intuition is that memory is fast, but thanks to improvements in storage (and
stagnation in RAM bandwidth), this is becoming a significant issue.

I'd appreciate any tips on how to spot problems of this type. But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time. If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%. Rather, it would show up as CPU time
hitting 100% while tps remains low.

Yea.

-e bus-cycles is a good start to measure where bus traffic is
relevant. Depending on the individual cpu other events can be helpful.

Process-management overhead is another thing we tend to ignore, but once you
get to many processes all willing to work at the same time, you need to
account for that.

Any tips on spotting problems in that area?

Not perfect, but -e context-switches (general context switches) and -e
syscalls:sys_enter_semop (for postgres enforced context switches) is
rather useful when combined with --call-graph dwarf ('fp' sometimes
doesn't see through libc which is most of the time not compiled with
-fno-omit-frame-pointer).

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Josh Berkus (#18)

Re: Horizontal scalability/sharding

Hi,

On 08/31/2015 10:16 PM, Josh Berkus wrote:

It's also important to recognize that there are three major use-cases
for write-scalable clustering:

* OLTP: small-medium cluster, absolute ACID consistency,
bottlnecked on small writes per second
* DW: small-large cluster, ACID optional,
bottlenecked on bulk reads/writes
* Web: medium to very large cluster, ACID optional,
bottlenecked on # of connections

We cannot possibly solve all of the above at once, but to the extent
that we recognize all 3 use cases, we can build core features which
can be adapted to all of them.

It would be good to have a discussion about use-cases first - each of us
is mostly concerned about the use cases they're dealing with, with
bottlenecks specific to their environment. These three basic use-cases
seem like a good start, but some of the details certainly don't match my
experience ...

For example I can't see how ACID can be optional for the DWH use-case,
but maybe there's a good explanation - I can imagine sacrificing various
ACID properties at the node level, but I can't really imagine
sacrificing any of the ACID properties for the cluster as a whole. So
this would deserve some explanation.

I also don't share the view that write scalability is the only (or even
main) issue, that we should aim to solve. For the business-intelligence
use cases I've been working on recently, handling complex read-only
ad-hoc queries is often much more important. And in those cases the
bottleneck is often CPU and/or RAM.

I'm also going to pontificate that, for a future solution, we should
not focus on write *IO*, but rather on CPU and RAM. The reason for
this thinking is that, with the latest improvements in hardware and
9.5 improvements, it's increasingly rare for machines to be
bottlenecked on writes to the transaction log (or the heap). This has
some implications for system design. For example, solutions which
require all connections to go through a single master node do not
scale sufficiently to be worth bothering with.

On some other questions from Mason:

Do we want multiple copies of shards, like the pg_shard approach? Or
keep things simpler and leave it up to the DBA to add standbys?

We want multiple copies of shards created by the sharding system itself.
Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

IMHO it'd be quite unfortunate if the design would make it impossible to
combine those two features (e.g. creating standbys for shards and
failing over to them).

It's true that solving HA at the sharding level (by keeping multiple
copies of a each shard) may be simpler than combining sharding and
standbys, but I don't see why it makes low-data-loss HA impossible.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Andres Freund (#50)

Re: Horizontal scalability/sharding

Hi,

On 09/01/2015 08:22 PM, Andres Freund wrote:

On 2015-09-01 14:11:21 -0400, Robert Haas wrote:

On Tue, Sep 1, 2015 at 2:04 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Memory bandwidth, for example. It's quite difficult to spot, because the
intuition is that memory is fast, but thanks to improvements in storage (and
stagnation in RAM bandwidth), this is becoming a significant issue.

I'd appreciate any tips on how to spot problems of this type. But
it's my impression that perf, top, vmstat, and other Linux performance
tools will count time spent waiting for memory as CPU time, not idle
time. If that's correct, that wouldn't explain workloads where CPU
utilization doesn't reach 100%. Rather, it would show up as CPU time
hitting 100% while tps remains low.

Yea.

-e bus-cycles is a good start to measure where bus traffic is
relevant. Depending on the individual cpu other events can be helpful.

long-story: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

It's from 2007 and only explains oprofile (chapter 7), which is mostly
abandoned in favor of perf nowadays. Perf can produce similar stats, so
the discussion is still valid. But it also shows cachegrind (valgrind
module).

perf examples: http://www.brendangregg.com/perf.html

Most of the examples with "CPU" in the comment are relevant. Usually
"perf stat" and "perf stat -d" are good starting points - once you get a
lot of LLC misses or too many instructions per cycle, it's a sign of
memory bandwidth problems.

Sadly, this is partially caused by our volcano-style executor and
sharding alone can do nothing about that.

Process-management overhead is another thing we tend to ignore, but once you
get to many processes all willing to work at the same time, you need to
account for that.

Any tips on spotting problems in that area?

Not perfect, but -e context-switches (general context switches) and -e
syscalls:sys_enter_semop (for postgres enforced context switches) is
rather useful when combined with --call-graph dwarf ('fp' sometimes
doesn't see through libc which is most of the time not compiled with
-fno-omit-frame-pointer).

Right, this is about the best I'm aware of.

The problem often is not in the number of context switches, but in the
fact that all the processes share the same (very limited) L caches on
the CPU. Each process dirties the caches for the other processes,
lowering the hit ratios. Which can be spotted using the commands above.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/01/2015 11:36 AM, Tomas Vondra wrote:

We want multiple copies of shards created by the sharding system itself.
Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

IMHO it'd be quite unfortunate if the design would make it impossible to
combine those two features (e.g. creating standbys for shards and
failing over to them).

It's true that solving HA at the sharding level (by keeping multiple
copies of a each shard) may be simpler than combining sharding and
standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMbeefb0cb831d205bdf7dce7fbdf1dbd797cafe0dd0dc46cb9974a599b106ad0a54619d945ccfbcf916be33a1da704120@asav-3.01.com

#54

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Josh Berkus (#53)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 3:19 PM, Josh Berkus <josh@agliodbs.com> wrote:

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

I mostly agree, but keep in mind that you don't have to replicate
everything. A lot of people might find it useful to replicate the
data that is common to all nodes even if they don't replicate the
sharded data.

Admittedly, there are some problems with snapshots here: if you don't
do anything special about snapshots, then what you have here will be
"eventually consistent" behavior. But that might be suitable for some
environments, such as very loosely coupled system where not all nodes
are connected all the time. And, for those environments where you do
need consistent snapshots, we can imagine ways to get that behavior,
like having the GTM consider the transaction uncommitted until it's
been logically replicated to every node.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Josh Berkus (#53)

Re: Horizontal scalability/sharding

Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:

On 09/01/2015 11:36 AM, Tomas Vondra wrote:

We want multiple copies of shards created by the sharding system
itself. Having a separate, and completely orthagonal, redundancy
system to the sharding system is overly burdensome on the DBA and
makes low-data-loss HA impossible.

IMHO it'd be quite unfortunate if the design would make it
impossible to combine those two features (e.g. creating standbys
for shards and failing over to them).

It's true that solving HA at the sharding level (by keeping
multiple copies of a each shard) may be simpler than combining
sharding and standbys, but I don't see why it makes low-data-loss
HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency
for all writes.

I haven't said that. I said that we should allow that topology, not that
it should be the only method of redundancy.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is a
problem for the use case. Which may be the case for many use cases, but
certainly is not a problem for many BI/DWH use cases performing mostly
large batch loads. In those cases the network bandwidth may be quite
important resource.

For example assume that there are just two shards in two separate data
centers, connected by a link with limited bandwidth. Now, let's assume
you always keep a local replica for failover. So you have A1+A2 in DC1,
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
have to write data to B2 and wait for it. So either you send the data to
each node separately (consuming 2x the bandwidth), or send it to B1 and
let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC
deployments the situation is quite different - not only that the network
bandwidth is not unlimited, but because latencies within DC may be a
fraction of latencies between the locations (to the extent that the
increase due to syncrep may be just noise). So the local replication may
be actually way faster.

I can imagine forwarding the data between B1 and B2 even with a purely
sharding solution, but at that point you effectively re-implemented syncrep.

IMHO the design has to address the multi-DC setups somehow. I think that
many of the customers who are so concerned about scaling to many shards
are also concerned about availability in case of DC outages, no?

We should also consider support for custom topologies (not just a full
mesh, or whatever we choose as the default/initial topology), which is
somehow related.

Futher, if using replication the sharding system would have no way
to (a) find out immediately if a copy was bad and (b) fail over
quickly to a copy of the shard if the first requested copy was not
responding. With async replication, we also can't use multiple copies
of the same shard as a way to balance read workloads.

I don't follow. With sync rep we do know whether the copy is OK or not,
because the node either confirms writes or not. The failover certainly
is more complicated and is not immediate (to the extent of keeping a
copy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as I
said - I can easily imagine most people using just sharding, while some
deployments use syncrep with manual failover.

If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think
that solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of
the sharding system if you're aiming for write scalability.

I haven't mentioned BDR at all, and given the async nature I don't have
a clear idea of how it fits into the sharding world at this point.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Joshua D. Drake (#36)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 08:18:38AM -0700, Joshua Drake wrote:

On 09/01/2015 02:48 AM, Bruce Momjian wrote:

On Tue, Sep 1, 2015 at 09:30:41AM +0530, Pavan Deolasee wrote:

There is no question that using XC/XL will get us to a usable solution
faster, but see my recent post to Josh Berkus --- the additional code
will be so burdensome that I doubt it would be accepted. If it was, I
bet we would have considered it long ago.

I think the only way we are going to get sharding into Postgres is to do
it in a way that enhances existing Postgres capabilities.

So that we have XL again?

Kind of. If XC/XL used FDWs I think we would try to use their code
first. The issue is that FDWs didn't exist at the time. I would say
our first approach might be doing XC/XL again with FDWs.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

David E. Wheeler

david@justatheory.com

over 10 years ago

In reply to: Robert Haas (#54)

1 attachment(s)

Re: Horizontal scalability/sharding

On Sep 1, 2015, at 1:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Admittedly, there are some problems with snapshots here: if you don't
do anything special about snapshots, then what you have here will be
"eventually consistent" behavior. But that might be suitable for some
environments, such as very loosely coupled system where not all nodes
are connected all the time.

Given that we’re discussing multi-node architectures here, you should expect that not all nodes will be connected at any time. Nodes fail, but the cluster should not.

And, for those environments where you do
need consistent snapshots, we can imagine ways to get that behavior,
like having the GTM consider the transaction uncommitted until it's
been logically replicated to every node.

Again, you need a way to deal with nodes going down. I can envision building a cluster with twelve nodes replicated to each of three geographically-distributed data centers. Each replication/sync model needs to be able to handle nodes going up and down, data centers or racks going up or down, and nodes being added and removed.

But even with smaller clusters, there’s no way around the fact that no system can guarantee that all nodes will be available at all times.

Best,

David

#58

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/01/2015 02:29 PM, Tomas Vondra wrote:

Hi,

On 09/01/2015 09:19 PM, Josh Berkus wrote:

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency
for all writes.

I haven't said that. I said that we should allow that topology, not that
it should be the only method of redundancy.

Ah, OK, I didn't understand you. Of course I'm in favor of supporting
both methods of redundancy if we can.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Which assumes that latency (or rather the increase due to syncrep) is a
problem for the use case. Which may be the case for many use cases, but
certainly is not a problem for many BI/DWH use cases performing mostly
large batch loads. In those cases the network bandwidth may be quite
important resource.

I'll argue that BI/DW is the least interesting use case for mainstream
PostgreSQL because there are production-quality forks which do this
(mostly propietary, but we can work on that). We really need a solution
which works for OLTP.

For example assume that there are just two shards in two separate data
centers, connected by a link with limited bandwidth. Now, let's assume
you always keep a local replica for failover. So you have A1+A2 in DC1,
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
have to write data to B2 and wait for it. So either you send the data to
each node separately (consuming 2x the bandwidth), or send it to B1 and
let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC
deployments the situation is quite different - not only that the network
bandwidth is not unlimited, but because latencies within DC may be a
fraction of latencies between the locations (to the extent that the
increase due to syncrep may be just noise). So the local replication may
be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

I can imagine forwarding the data between B1 and B2 even with a purely
sharding solution, but at that point you effectively re-implemented
syncrep.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node. In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

IMHO the design has to address the multi-DC setups somehow. I think that
many of the customers who are so concerned about scaling to many shards
are also concerned about availability in case of DC outages, no?

Certainly. But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

I don't follow. With sync rep we do know whether the copy is OK or not,
because the node either confirms writes or not. The failover certainly
is more complicated and is not immediate (to the extent of keeping a
copy at the sharding level), but it's a question of trade-offs.

It's true we don't have auto-failover solution at the moment, but as I
said - I can easily imagine most people using just sharding, while some
deployments use syncrep with manual failover.

As long as direct shard copying is available, I'm happy. I have no
complaints about additional mechanisms.

I'm bringing this up because the FDW proposal made at pgCon did not
include *any* mechanism for HA/redundancy, just some handwaving about
replication and/or BDR. This was one of the critical design failures of
PostgresXC. A multinode system without automated node failover and
replacement is a low-availability system.

If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.

In what way is the total network bandwitdh used in the system different
for shard copying than for sync replication?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM7f83fc8e11e3028f49c0a3e58da450b19b1ceb15b2873dc1eeaf13c5f372b83c0faa2f31d0a568e4606305bdcc0166bf@asav-2.01.com

#59

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Robert Haas (#42)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 12:40:40PM -0400, Robert Haas wrote:

On Tue, Sep 1, 2015 at 6:55 AM, Bruce Momjian <bruce@momjian.us> wrote:

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.

I do think we are going to want to support that. All the people I've
talked to about parallel and distributed query processing agree that
you need to do that sort of thing to get really good and scalable
performance. I think that we could make a lot of headway as compared
with the status quo just by implementing more pushdown optimizations
than we have today. Right now, SELECT COUNT(*) FROM table will suck
back the whole remote table and count the rows locally, and that's
stupid. We can fix that case with better pushdown logic. We can also
fix the case of N-way join nests where the joins are either on the
partitioning key or to replicated tables. But suppose you have a join
between two tables which are sharded across the cluster but not on the
partitioning key. There's no way to push the join down, so all the
work comes back to the coordinator, which is possibly OK if such
queries are rare, but not so hot if they are frequent.

Let me clearer about what the Citus Data paper shows. I said originally
that the data was sent to the coordinator, sorted, then resent to the
shards, but the document:

https://goo.gl/vJWF85
https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

has the shards create the groups and the groups are sent to the other
shards. For example, to do COUNT(DISTINCT) if you have three shards,
then each shard breaks its data into 3 buckets (1B in size), then the
first bucket from each of the three shards goes to the first shard, and
the second bucket goes to the second shared, etc.

Basically, they are doing map-reduce, and the shards are creating
additional batches that get shipped to other shards. I can see FDWs not
working well in that case as you are really creating a new data layout
just for the query. This explains why the XC/XL people are saying they
would use FDWs if they existed at the time they started development,
while the Citus Data people are saying they couldn't use FDWs as they
currently exist. They probably both needed FDW improvements, but I
think the Citus Data features would need a lot more.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Peter Geoghegan

pg@heroku.com

over 10 years ago

In reply to: Robert Haas (#45)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 1, 2015 at 1:06 PM, Josh Berkus <josh@agliodbs.com> wrote:

You're assuming that our primary bottleneck for writes is IO. It's not
at present for most users, and it certainly won't be in the future. You
need to move your thinking on systems resources into the 21st century,
instead of solving the resource problems from 15 years ago.

Your experience doesn't match mine. I find that it's frequently
impossible to get the system to use all of the available CPU capacity,
either because you're bottlenecked on locks or because you are
bottlenecked on the I/O subsystem, and with the locking improvements
in newer versions, the former is becoming less and less common.

I think you're both right. I think that we need to fix the buffer
manager, to make its caching algorithm smarter. Since we're mostly
using the filesystem cache, this is particularly important for
PostgreSQL. We need to remember usage information for evicted blocks
for some period of time afterwards. This problem is largely a problem
with Postgres in particular, I suspect.

At the same time, I agree with Josh's assessment that long-term, we
are going to have the biggest problem with memory latency and memory
bandwidth, which are usually considered facets of CPU performance, and
with internal lock contention. Addressing the memory access bottleneck
dovetails with parallelism, in that it must be considered alongside
parallelism. Josh's "21st century" remark seems quite justified to me.
For a further example of this, check out my latest progress with
external sorting, which I plan to post later in the week. I/O isn't
the big problem there at all, and I now think we can make external
sorts close to internal sorts in performance across the board.

I imagine that Josh's experience is based on workloads that mostly fit
in shared_buffers, so I can see why you'd disagree if that was
something you've seen less of. I'll quote Hellerstein and Stonebraker
in 2007 [1]http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, page 213 -- Peter Geoghegan:

"Copying data in memory can be a serious bottleneck. Copies contribute
latency, consume CPU cycles, and can flood the CPU data. This fact is
often a surprise to people who have not operated or implemented a
database system, and assume that main-memory operations are “free”
compared to disk I/O. But in practice, throughput in a well-tuned
transaction processing DBMS is typically not I/O-bound. This is
achieved in high-end installations by purchasing sufficient disks and
RAM so that repeated page requests are absorbed by the buffer pool,
and disk I/Os are shared across the disk arms at a rate that can feed
the data appetite of all the processors in the system."

Our real future bottlenecks are:

* ability to handle more than a few hundred connections
* locking limits on the scalability of writes
* ability to manage large RAM and data caches

I do agree that all of those things are problems, FWIW.

These seem like our long term problems, to me.

[1]: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf, page 213 -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Petr Jelinek

petr@2ndquadrant.com

over 10 years ago

In reply to: Josh Berkus (#58)

Re: Horizontal scalability/sharding

On 2015-09-02 00:09, Josh Berkus wrote:

On 09/01/2015 02:29 PM, Tomas Vondra wrote:

For example assume that there are just two shards in two separate data
centers, connected by a link with limited bandwidth. Now, let's assume
you always keep a local replica for failover. So you have A1+A2 in DC1,
B1+B2 in DC2. If you're in DC1, then writing data to B1 means you also
have to write data to B2 and wait for it. So either you send the data to
each node separately (consuming 2x the bandwidth), or send it to B1 and
let it propagate to B2 e.g. through sync rep.

So while you may be right in single-DC deployments, with multi-DC
deployments the situation is quite different - not only that the network
bandwidth is not unlimited, but because latencies within DC may be a
fraction of latencies between the locations (to the extent that the
increase due to syncrep may be just noise). So the local replication may
be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

Shard copying usually assumes that the origin node does the copy - the
data has to go twice through the slow connection. With replication you
can replicate locally over fast connection.

I can imagine forwarding the data between B1 and B2 even with a purely
sharding solution, but at that point you effectively re-implemented
syncrep.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node. In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

IMHO the design has to address the multi-DC setups somehow. I think that
many of the customers who are so concerned about scaling to many shards
are also concerned about availability in case of DC outages, no?

Certainly. But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

Agreed, so we should support both ways, but I don't think it's necessary
to support both ways in version 0.1. It's just important to not paint
ourselves into a corner with design decisions that would make one of the
ways impossible.

If we write to multiple copies as a part of the sharding feature,
then that can be parallelized, so that we are waiting only as long as
the slowest write (or in failure cases, as long as the shard
timeout). Further, we can check for shard-copy health and update
shard availability data with each user request, so that the ability
to see stale/bad data is minimized.

Again, this assumes infinite network bandwidth.

In what way is the total network bandwitdh used in the system different
for shard copying than for sync replication?

Again, when shards are distributed over multiple DCs (or actually even
multiple racks) the bandwidth and latency of local copy will be much
better then the one of the remote copy so the local replication can have
much lower impact on the cluster performance than remote shard copy will.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Robert Haas (#40)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 1, 2015 at 4:15 AM, Andres Freund <andres@anarazel.de>

wrote:

On 2015-08-31 20:54:51 -0400, Bruce Momjian wrote:

Uh, we already have a list of things we need to add to FDWs to make

them

work, and Citus Data has provided a document of more things that are
needed, https://goo.gl/vJWF85. I am not sure how much bigger a red

flag

you want to confirm that everyone agrees that major FDW improvements

are

a requirement for this.

Several people saying that the FDW infrastructure isn't sufficient

right

now is pretty far from implying that all of them agree that the FDW

API

is the way to go.

I'm not sure myself. If it works out it's going to save us some work

and

make it more realistic to get there sometime not too far off. But I'm
afraid that the resulting system will feel like our current

partitioning

implemenentation. Yes, it kinda works, but it's hard to get started,

doesn't support too many features and you're kind afraid your

relatives

will see what you've done.

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#63

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#34)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 4:25 PM, Bruce Momjian <bruce@momjian.us> wrote:

The document opens a big question --- when queries can't be processed in
a traditional top/down fashion, Citus has the goal of sending groups of
results up the the coordinator, reordering them, then sending them back
to the shards for further processing, basically using the shards as
compute engines because the shards are no longer using local data to do
their computations. The two examples they give are COUNT(DISTINCT) and
a join across two sharded tables ("CANADA").

I assumed these queries were going to be solved by sending as digested
data as possible to the coordinator, and having the coordinator complete
any remaining processing. I think we are going to need to decide if
such "sending data back to shards" is something we are ever going to
implement. I can see FDWs _not_ working well for that use-case.

Here one related point to think is how do we envision to handle statement
requests, do we want to have centeralized coordinator which will process
all requests or the requests could be received by any node?
I think both kind of systems have their own pros and cons like if we want
to have centralized coordinator kind of system, then it might be limited
by the number of simultaneous requests it can handle and if go other way
like allow requests to be processed by each individual nodes, then we
have to think about replicating all meta-data on all nodes.

I think Collecting statistics about different objects is another thing which
can differ depending on the strategy we choose to allow requests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#64

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Kapila (#62)

Re: Horizontal scalability/sharding

On 2015-09-02 PM 01:28, Amit Kapila wrote:

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.

I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Langote (#64)

Re: Horizontal scalability/sharding

On 2015/09/02 14:28, Amit Langote wrote:

On 2015-09-02 PM 01:28, Amit Kapila wrote:

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.

I wonder if Ashutosh's atomic foreign transactions patch would address any
issues inherent in such cases...

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Etsuro Fujita (#65)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>
wrote:

On 2015/09/02 14:28, Amit Langote wrote:

On 2015-09-02 PM 01:28, Amit Kapila wrote:

On Tue, Sep 1, 2015 at 9:48 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I'm not averse to making the "connect to the remote nodes" part of
this solution use something other than the FDW infrastructure at some
point in time if somebody's prepared to build something better. On
the other hand, I think it's extremely clear that the FDW
infrastructure has a large amount of potential upon which we have
thoroughly failed to capitalize. Patches have already been written
for UPDATE/DELETE pushdown and for join pushdown.

Will pushing down writes (Update/Delete) sufficient to maintain sane

locking
behaviour and deadlock detection that can occur during writes on multiple
shards? For example it could easily be the case where a single Update
statement could effect multiple shards and cause deadlock due to waits
across the nodes. Now unless we have some distributed lock manager or
some other way to know the information of locks that happens across
shards, it could be difficult to detect deadlocks.

I wonder if Ashutosh's atomic foreign transactions patch would address any

issues inherent in such cases...

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.

Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#67

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Kapila (#66)

Re: Horizontal scalability/sharding

On 2015-09-02 PM 03:25, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.

Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Langote (#67)

Re: Horizontal scalability/sharding

On 2015/09/02 15:40, Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 11:35 AM, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>

The UPDATE/DELETE pushdown, which I've proposed, would ensure the sane
behaviour for inherited UPDATEs/DELETEs, as existing non-pushed-down
UPDATE/DELETE does, because inheritance_planner guarantees that all
backends lock inheritance children in the same order to avoid needless
deadlocks.

Will it be able to do it for row level locks, row level locking occurs
during updation of a row, so will it be possible to ensure the order of
locks on rows?

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

Yeah, I thinks so, too.

Sorry, maybe my explanation above was not enough, but in the inherted
UPDATEs/DELETEs, the table modification is also ensured to be done in
the same order. So, as Amit Langote said, both transactions will do the
updates in the same order.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Albe Laurenz

laurenz.albe@wien.gv.at

over 10 years ago

In reply to: Amit Langote (#67)

Re: Horizontal scalability/sharding

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Albe Laurenz (#69)

Re: Horizontal scalability/sharding

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Sorry, I failed to see why that would be the case. Could you elaborate?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Langote (#70)

Re: Horizontal scalability/sharding

On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Sorry, I failed to see why that would be the case. Could you elaborate?

I think Laurenz would assume that the updates 1. and 2. in the above
transactions are performed *in a non-inherited manner*. If that's
right, T1 and T2 could deadlock, but I think we assume here to run
transactions over shards *in an inherited manner*.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Albe Laurenz

laurenz.albe@wien.gv.at

over 10 years ago

In reply to: Etsuro Fujita (#71)

Re: Horizontal scalability/sharding

Etsuro Fujita wrote:

On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

Amit Langote wrote:

On 2015-09-02 PM 03:25, Amit Kapila wrote:

Will it handle deadlocks across different table partitions. Consider
a case as below:

T1
1. Updates row R1 of T1 on shard S1
2. Updates row R2 of T2 on shard S2

T2
1. Updates row R2 of T2 on shard S2
2. Updates row R1 of T1 on shard S1

As long as shards are processed in the same order in different
transactions, ISTM, this issue should not arise? I can imagine it becoming
a concern if parallel shard processing enters the scene. Am I missing
something?

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same order.

So T1 and T2 could deadlock.

Sorry, I failed to see why that would be the case. Could you elaborate?

I think Laurenz would assume that the updates 1. and 2. in the above
transactions are performed *in a non-inherited manner*. If that's
right, T1 and T2 could deadlock, but I think we assume here to run
transactions over shards *in an inherited manner*.

Yes, but does every update affect all shards?

If I say "UPDATE t1 SET col = 1 WHERE id = 42" and the row with id 42
happens to be on shard S1, the update would only affect that shard, right?

Now if "UPDATE t2 SET col = 1 WHERE id = 42" would only take place on
shard S2, and two transactions issue both updates in different order,
one transaction would be waiting for a lock on shard S1, while the other
would be waiting for a lock on shard S2, right?

But maybe I'm missing something fundamental.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Etsuro Fujita (#71)

Re: Horizontal scalability/sharding

On 2015-09-02 PM 05:07, Etsuro Fujita wrote:

On 2015/09/02 16:40, Amit Langote wrote:

On 2015-09-02 PM 04:07, Albe Laurenz wrote:

That would only hold for a single query, right?

If 1. and 2. in the above example come from different queries within one
transaction, you cannot guarantee that shards are processed in the same
order.

So T1 and T2 could deadlock.

Sorry, I failed to see why that would be the case. Could you elaborate?

I think Laurenz would assume that the updates 1. and 2. in the above
transactions are performed *in a non-inherited manner*. If that's right,
T1 and T2 could deadlock, but I think we assume here to run transactions
over shards *in an inherited manner*.

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;
<ok>

-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;
<ok>

-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;
<waits>

-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;
<deadlock is detected>

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Josh Berkus (#53)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 12:49 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/01/2015 11:36 AM, Tomas Vondra wrote:

We want multiple copies of shards created by the sharding system itself.
Having a separate, and completely orthagonal, redundancy system to the
sharding system is overly burdensome on the DBA and makes low-data-loss
HA impossible.

IMHO it'd be quite unfortunate if the design would make it impossible to
combine those two features (e.g. creating standbys for shards and
failing over to them).

It's true that solving HA at the sharding level (by keeping multiple
copies of a each shard) may be simpler than combining sharding and
standbys, but I don't see why it makes low-data-loss HA impossible.

Other way around, that is, having replication standbys as the only
method of redundancy requires either high data loss or high latency for
all writes.

In the case of async rep, every time we fail over a node, the entire
cluser would need to roll back to the last common known-good replay
point, hence high data loss.

In the case of sync rep, we are required to wait for at least double
network lag time in order to do a single write ... making
write-scalability quite difficult.

Futher, if using replication the sharding system would have no way to
(a) find out immediately if a copy was bad and (b) fail over quickly to
a copy of the shard if the first requested copy was not responding.
With async replication, we also can't use multiple copies of the same
shard as a way to balance read workloads.

If we write to multiple copies as a part of the sharding feature, then
that can be parallelized, so that we are waiting only as long as the
slowest write (or in failure cases, as long as the shard timeout).
Further, we can check for shard-copy health and update shard
availability data with each user request, so that the ability to see
stale/bad data is minimized.

XC (and I guess XL, pgPool II as well) did this by firing same DML
statement to all the copies after resolving any volatile references (e.g.
now()) in DML, so that all the copies get the same values. That method
however needed some row identifier which can identify same row on all the
replicas. Primary key is used as row identifier usually, but not all use
cases which require shards to be replicated have primary key in their
sharded tables.

There are obvious problems with multiplexing writes, which you can
figure out if you knock pg_shard around a bit. But I really think that
solving those problems is the only way to go.

Mind you, I see a strong place for binary replication and BDR for
multi-region redundancy; you really don't want that to be part of the
sharding system if you're aiming for write scalability.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#75

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Langote (#73)

Re: Horizontal scalability/sharding

On 2015-09-02 PM 06:41, Amit Langote wrote:

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;
<ok>

-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;
<ok>

-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;
<waits>

-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;
<deadlock is detected>

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Pavan Deolasee

pavan.deolasee@gmail.com

over 10 years ago

In reply to: Amit Langote (#75)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

On 2015-09-02 PM 06:41, Amit Langote wrote:

I think Albe may have a point here...

Even inherited updates case appears to cause a deadlock if they are in
different queries. Demonstrated below:

-- setup
CREATE TABLE t(a int);
CREATE TABLE t1() INHERITS(t);
CREATE TABLE t2() INHERITS(t);

INSERT INTO t1 VALUES (1);
INSERT INTO t2 VALUES (2);

-- in session 1
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 1;
<ok>

-- in session 2
BEGIN;
UPDATE t SET a = a + 1 WHERE a = 2;
<ok>

-- back in session 1
UPDATE t SET a = a + 1 WHERE a = 2;
<waits>

-- back in session 2
UPDATE t SET a = a + 1 WHERE a = 1;
<deadlock is detected>

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in the
same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Right. XC/XL did not address this issue and they rely on statement timeouts
to break distributed deadlocks.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#77

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Pavan Deolasee (#76)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com>
wrote:

On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote <

Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-09-02 PM 06:41, Amit Langote wrote:

Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks in

the

same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Right. XC/XL did not address this issue and they rely on statement

timeouts to break distributed deadlocks.

I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time. I think it is better to
have solution for this problem and deadlock should be detected.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#78

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 10 years ago

In reply to: Amit Kapila (#77)

Re: Horizontal scalability/sharding

On 2015/09/02 20:42, Amit Kapila wrote:

On Wed, Sep 2, 2015 at 4:19 PM, Pavan Deolasee <pavan.deolasee@gmail.com
<mailto:pavan.deolasee@gmail.com>> wrote:

On Wed, Sep 2, 2015 at 3:55 PM, Amit Langote

<Langote_Amit_f8@lab.ntt.co.jp <mailto:Langote_Amit_f8@lab.ntt.co.jp>>
wrote:

On 2015-09-02 PM 06:41, Amit Langote wrote:
Which, I now realize, is not the worry Amit Kapila's expresses.

The deadlock was *indeed detected* in this case, with all the locks

in the

same PG instance. In a sharded environment with multiple PG instances,
that becomes tricky. DLM (distributed lock manager/deadlock detector)
seems indeed necessary as Amit K. suspects.

Ah, you are right.

Right. XC/XL did not address this issue and they rely on statement

timeouts to break distributed deadlocks.

I think that will be difficult for application to decide and then it
needs to decide the same for all statements which is tricky because
different statements could take different time. I think it is better to
have solution for this problem and deadlock should be detected.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Bruce Momjian (#59)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 06:11:45PM -0400, Bruce Momjian wrote:

Let me clearer about what the Citus Data paper shows. I said originally
that the data was sent to the coordinator, sorted, then resent to the
shards, but the document:

https://goo.gl/vJWF85
https://www.citusdata.com/blog/114-how-to-build-your-distributed-database

has the shards create the groups and the groups are sent to the other
shards. For example, to do COUNT(DISTINCT) if you have three shards,
then each shard breaks its data into 3 buckets (1B in size), then the
first bucket from each of the three shards goes to the first shard, and
the second bucket goes to the second shared, etc.

Basically, they are doing map-reduce, and the shards are creating
additional batches that get shipped to other shards. I can see FDWs not
working well in that case as you are really creating a new data layout
just for the query. This explains why the XC/XL people are saying they
would use FDWs if they existed at the time they started development,
while the Citus Data people are saying they couldn't use FDWs as they
currently exist. They probably both needed FDW improvements, but I
think the Citus Data features would need a lot more.

To expand on this, using FDWs, it means each shard would create a
temporary table on the other shards and send some if its data to those
shards. Once a shard gets all its data from the other shards, it will
process the data and send the result to the collector.

That certainly seems like something FDWs would not do well. Frankly, I
am unclear how Citus Data was able to do this with only backend hooks.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Oleg Bartunov

obartunov@gmail.com

over 10 years ago

In reply to: Robert Haas (#39)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 7:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 1, 2015 at 12:00 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

My worry is that if we start implementing them again from scratch, it

will

take a few years before we get them in a usable state. What XC/XL lacked

is

probably a Robert Haas or a Tom Lane who could look at the work and

suggest

major edits. If that had happened, the quality of the product could have
been much better today. I don't mean to derate the developers who worked

on

XC/XL, but there is no harm in accepting that if someone with a much

better

understanding of the whole system was part of the team, that would have
positively impacted the project. Is that an angle worth exploring? Does

it

make sense to commit some more resources to say XC or XL and try to

improve

the quality of the product even further? To be honest, XL is in far far
better shape (haven't really tried XC in a while) and some more

QA/polishing

can make it production ready much sooner.

From my point of view, and EnterpriseDB's point of view, anything that
doesn't go into the core PostgreSQL distribution isn't really getting
us where we need to be. If there's code in XL that would be valuable
to merge into core PostgreSQL, then let's do it. If the code cannot
be used but there are lessons we can learn that will make what does go
into core PostgreSQL better, let's learn them. However, I don't think
it's serving anybody very well that we have the XC fork, and multiple
forks of the XC fork, floating around out there and people are working
on those instead of working on core PostgreSQL. The reality is that
we don't have enough brainpower to spread it across 2 or 3 or 4 or 5
different projects and have all of them be good. The reality is,
also, that horizontal scalability isn't an optional feature. There
was a point in time at which the PostgreSQL project's official policy
on replication was that it did not belong in core. That was a bad
policy; thankfully, it was reversed, and the result was Hot Standby
and Streaming Replication, incredibly important technologies without
which we would not be where we are today. Horizontal scalability is
just as essential.

Agree with you, Robert.

One lesson from XL we got is that we need testing framework for cluster, so
any cluster project should at least pass functional and performance
testing. XL was very easy to break and I'm wondering how many corner cases
still exists. We tried several other approaches and while reading the
papers was a fun, in practice we found many devil details, which made the
paper be just a paper.

Show quoted text

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#81

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/01/2015 04:14 PM, Petr Jelinek wrote:

On 2015-09-02 00:09, Josh Berkus wrote:

On 09/01/2015 02:29 PM, Tomas Vondra wrote:

So while you may be right in single-DC deployments, with multi-DC
deployments the situation is quite different - not only that the network
bandwidth is not unlimited, but because latencies within DC may be a
fraction of latencies between the locations (to the extent that the
increase due to syncrep may be just noise). So the local replication may
be actually way faster.

I'm not seeing how the above is better using syncrep than using shard
copying?

Shard copying usually assumes that the origin node does the copy - the
data has to go twice through the slow connection. With replication you
can replicate locally over fast connection.

Ah, I was thinking of the case of having a single set of copies in the
remote DC, but of course that isn't going to be the case with a highly
redundant setup.

Basically this seems to be saying that, in an ideal setup, we'd have
some kind of synchronous per-shard replication. We don't have that at
present (sync rep is whole-node, and BDR is asynchronous). There's also
the question of how to deal with failures and taking bad nodes out of
circulation in such a setup, especially considering that the writes
could be coming from multiple other nodes.

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node. In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

Is there a way to do logical synchronous replication? I didn't think
there was.

IMHO the design has to address the multi-DC setups somehow. I think that
many of the customers who are so concerned about scaling to many shards
are also concerned about availability in case of DC outages, no?

Certainly. But users located in a single DC shouldn't pay the same
overhead as users who are geographically spread.

Agreed, so we should support both ways, but I don't think it's necessary
to support both ways in version 0.1. It's just important to not paint
ourselves into a corner with design decisions that would make one of the
ways impossible.

Exactly!

Let me explain why I'm so vocal on this point. PostgresXC didn't deal
with the redundancy/node replacement at all until after version 1.0.
Then, when they tried to address it, they discovered that the code was
chock full of assumptions that "1 node == 1 shard", and breaking that
assumption would require a total refactor of the code (which never
happened). I don't want to see a repeat of that mistake.

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems. Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMe105041c5c31fd584db1ab3b6b50e19d4a9df79f31d0248c2c25816877712bc435906a6073fce5cf79a2477b8043ca5d@asav-3.01.com

#82

Merlin Moncure

mmoncure@gmail.com

over 10 years ago

In reply to: Robert Haas (#40)

Re: Horizontal scalability/sharding

On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It would be a bad idea to cling blindly to the FDW infrastructure if
it's fundamentally inadequate to do what we want. On the other hand,
it would also be a bad idea to set about recreating it without a
really good reason, and - just to take one example - the fact that it
doesn't currently push down DML operations to the remote side is not a
really good reason to rewrite the whole thing. On the contrary, it's
a reason to put some energy into the already-written patch which
implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API. There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing. This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result. To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Merlin Moncure (#82)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Sep 1, 2015 at 11:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

It would be a bad idea to cling blindly to the FDW infrastructure if
it's fundamentally inadequate to do what we want. On the other hand,
it would also be a bad idea to set about recreating it without a
really good reason, and - just to take one example - the fact that it
doesn't currently push down DML operations to the remote side is not a
really good reason to rewrite the whole thing. On the contrary, it's
a reason to put some energy into the already-written patch which
implements that optimization.

The problem with FDW for these purposes as I see it is that too much
intelligence is relegated to the implementer of the API. There needs
to be a mechanism so that the planner can rewrite the remote query and
then do some after the fact processing. This exactly what citus does;
if you send out AVG(foo) it rewrites that to SUM(foo) and COUNT(foo)
so that aggregation can be properly weighted to the result. To do
this (distributed OLAP-type processing) right, the planner needs to
*know* that this table is in fact distributed and also know that it
can make SQL compatible adjustments to the query.

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might make
sense for single table queries but breaks down in the face of joins.

Well, I don't think that ALL of the logic should go into the FDW. In
particular, in this example, parallel aggregate needs the same query
rewrite, so the logic for that should live in core so that both
parallel and distributed queries get the benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Josh Berkus (#81)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote:

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems. Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it. Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only. You don't lose any data because you made enough copies at
some point in the distant past to ensure that you wouldn't. You
replace a lost node by taking anew copy. Nodes never need to catch up
because there are no changes happening. To make bring up a new node,
you make a copy of an existing node (which doesn't change in the
meantime). So most of these concerns are about how to handle writes.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason. The issue arises when you have multiple copies of the data,
and the data is being changed. Now, you have to worry about the
copies getting out of sync with each other, especially when failures
happen.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants. Then it
generates an SQL statement to be sent to the data node to make the
appropriate change. If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node. I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good. Fundamentally, this model doesn't
allow for many good options in that case.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes. But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/02/2015 11:41 AM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:57 PM, Josh Berkus <josh@agliodbs.com> wrote:

Even if it's only on paper, any new sharding design needs to address
these questions:

1. How do we ensure no/minimal data is lost if we lose a node?
2. How do we replace a lost node (without taking the cluster down)?
2. a. how do we allow an out-of-sync node to "catch up"?
3. How do we maintain metadata about good/bad nodes (and shard locations)?
4. How do we add nodes to expand the cluster?

There doesn't need to be code for all of the above from version 0.1, but
there needs to be a plan to tackle those problems. Otherwise, we'll
just end up with another dead-end, not-useful-in-production technology.

This is a good point, and I think I agree with it. Let me make a few
observations:

1. None of this stuff matters very much when the data is strictly
read-only.

Yep.

2. None of this stuff matters when you only have one copy of the data.
Your system is low-availability, but you just don't care for whatever
reason.

Uh-huh.

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants. Then it
generates an SQL statement to be sent to the data node to make the
appropriate change. If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node. I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good. Fundamentally, this model doesn't
allow for many good options in that case.

pg_shard also sends the data to each node, and automatically notices
which nodes are not responding and takes them out of availability.
There isn't a "catch up" feature yet (AFAIK), or any attempt to reduce
volatile functions.

For that matter, last I worked on it Greenplum also did multiplexing via
the writing node (or via the data loader). So this is a popular
approach; it has a number of drawbacks, though, of which volatile
functions are a major one.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes.

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

Yeah? I'd assume that a GTM would be antithetical to two-stage copying.
I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WMa5b14d42ed2e19668821e4b4ea45db677695728b421abff3ab9fb6bbb80d1c0dc4245cf8eeee05be4e14b399432a4f11@asav-2.01.com

#86

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Josh Berkus (#85)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes.

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

Yep. It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions. Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone. That's a problem.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

How is that different from the previous thing? Just that we'd treat
"lagging" as "down" beyond some threshold? That doesn't seem like a
mandatory feature.

But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

Yeah? I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so. If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there. That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read. It's
complicated, but it doesn't seem intractable.

I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

I partially agree. I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component. People who don't want it shouldn't have to
pay the price in performance and administrative complexity. But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#83)

Re: Horizontal scalability/sharding

On 09/02/2015 08:27 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 1:59 PM, Merlin Moncure <mmoncure@gmail.com>
wrote:

This strikes me as a bit of a conflict of interest with FDW which
seems to want to hide the fact that it's foreign; the FDW
implementation makes it's own optimization decisions which might
make sense for single table queries but breaks down in the face of
joins.

+1 to these concerns

Well, I don't think that ALL of the logic should go into the FDW.

Then maybe we shouldn't call this "FDW-based sharding" (or "FDW
approach" or whatever was used in this thread so far) because that kinda
implies that the proposal is to build on FDW.

In my mind, FDW is a wonderful tool to integrate PostgreSQL with
external data sources, and it's nicely shaped for this purpose, which
implies the abstractions and assumptions in the code.

The truth however is that many current uses of the FDW API are actually
using it for different purposes because there's no other way to do that,
not because FDWs are the "right way". And this includes the attempts to
build sharding on FDW, I think.

Situations like this result in "improvements" of the API that seem to
improve the API for the second group, but make the life harder for the
original FDW API audience by making the API needlessly complex. And I
say "seem to improve" because the second group eventually runs into the
fundamental abstractions and assumptions the API is based on anyway.

And based on the discussions at pgcon, I think this is the main reason
why people cringe when they hear "FDW" and "sharding" in the same sentence.

I'm not opposed to reusing the FDW infrastructure, of course.

In particular, in this example, parallel aggregate needs the same
query rewrite, so the logic for that should live in core so that
both parallel and distributed queries get the benefit.

I'm not sure the parallel query is a great example here - maybe I'm
wrong but I think it's a fairly isolated piece of code, and we have
pretty clear idea of the two use cases.

I'm sure it's non-trivial to design it well for both cases, but I think
the questions for FWD/sharding will be much more about abstract concepts
than particular technical solutions.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Bruce Momjian (#1)

Re: Horizontal scalability/sharding

On 09/02/2015 12:30 PM, Robert Haas wrote:

On Wed, Sep 2, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes.

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

Yep. It's clear to me we need that in general, not just for sharding.
To me, the key is to make sure there's a way for the cluster-ware to
know about the state transitions. Currently, when the synchronous
standby changes, PostgreSQL doesn't tell anyone. That's a problem.

There are many parts of our replication which are still effectively
unmonitorable. For example, there's still no way to tell from the
replica that it's lost contact with the master except by tailing the
log. If we try to build bigger systems on top of these components,
we'll find that we need to add a lot of instrumentation.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

How is that different from the previous thing? Just that we'd treat
"lagging" as "down" beyond some threshold? That doesn't seem like a
mandatory feature.

It's a mandatory feature if you want to load-balance reads. We have to
know which nodes not to send reads to because they are out of sync.

Yeah? I'd assume that a GTM would be antithetical to two-stage copying.

I don't think so. If transaction A writes data on X which is
replicated to Y and then commits, a new snapshot which shows A as
committed can't be used on Y until A's changes have been replicated
there. That could be enforced by having the commit of A wait for
replication, or by having an attempt by a later transaction to use the
snapshot on Y wait until replication completes, or some even more
sophisticated strategy that considers whether the replication backlog
touches the same data that the new transaction will read. It's
complicated, but it doesn't seem intractable.

I need to see this on a chalkboard to understand it.

I'm not a big fan of a GTM at all, frankly; it makes clusters much
harder to set up, and becomes a SPOF.

I partially agree. I think it's very important that the GTM is an
optional feature of whatever we end up with, rather than an
indispensable component. People who don't want it shouldn't have to
pay the price in performance and administrative complexity. But at
the same time, I think a lot of people will want it, because without
it, the fact that sharding is in use is much less transparent to the
application.

If it can be optional, then we're pretty close to covering most use
cases with one general infrastructure. That would be nice.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM007f56d274ff12736775384af4fa63f7668b38ac6f3cef669612a1f9b382568552c310f0e5daa184e75a4d4a7e55d5d8@asav-2.01.com

#89

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Robert Haas (#84)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes. But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Josh Berkus (#85)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 12:03:36PM -0700, Josh Berkus wrote:

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

I think this gets back to the idea of running an administrative alert
command when we switch to using a different server for
synchronous_standby_names. We can't just keep requiring external
tooling to identify things that the database knows easily and can send
an alert. Removing failed nodes is also something we should do and
notify users about.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#89)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 6:56 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes. But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data. So let's say you have 4 machines. You divide the data into 12
chunks. Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others. So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3. The write load therefore remains
evenly divided. If you can only copy entire machines, you can't
achieve that in this situation.

I'm not saying that the above is exactly what we're going to end up
with, or even necessarily close. But a big part of the point of
sharding is that not all the machines have the same data - otherwise
you are not write scaling. But it will frequently be the case, for
various reasons, that they have *overlapping* sets of data. Logical
replication can handle that; physical replication can't.

In Postgres-XC, all tables are either sharded (part of the table is
present on each node) or distributed (all of the table is present on
every node). Clearly, there's no way to use physical replication in
that scenario except if you are OK with having two copies of every
node. But that's not a very good solution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Robert Haas (#84)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 3:41 AM, Robert Haas wrote:

3. IIUC, Postgres-XC handles this problem by reducing at least
volatile functions, maybe all functions, to constants. Then it
generates an SQL statement to be sent to the data node to make the
appropriate change. If there's more than one copy of the data, we
send a separate copy of the SQL statement to every node. I'm not sure
exactly what happens if some of those nodes are not available, but I
don't think it's anything good. Fundamentally, this model doesn't
allow for many good options in that case.

I don't recall that. Immutable functions are switched to constants in
the query sent to datanodes. Volatile and stable functions are
evaluated locally after fetching the results from the remote node. Not
that efficient for warehouse loads. My 2c.

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes.

That's something that the quorum synchronous patch would address.
Still, having the possibility to be synchronous across multiple nodes
does not seem like to be something at the top of the list.

Also, the GTM needs to be aware that this stuff is happening, or it will DTWT. That too seems like a problem that can be solved.

If I understood correctly, yes it is with its centralized transaction
facility each node is aware of the transaction status via the global
snapshot.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Joshua D. Drake

jd@commandprompt.com

over 10 years ago

In reply to: Bruce Momjian (#89)

Re: Horizontal scalability/sharding

On 09/02/2015 03:56 PM, Bruce Momjian wrote:

On Wed, Sep 2, 2015 at 02:41:46PM -0400, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes. But in theory
those seem like limitations that can be lifted. Also, the GTM needs
to be aware that this stuff is happening, or it will DTWT. That too
seems like a problem that can be solved.

Can you explain why logical replication is better than binary
replication for this use-case?

Selectivity?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Joshua D. Drake (#93)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:

Can you explain why logical replication is better than binary
replication for this use-case?

Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Robert Haas (#91)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 09:03:25PM -0400, Robert Haas wrote:

Can you explain why logical replication is better than binary
replication for this use-case?

Uh, well, for the same reasons it is better in many other cases.
Particularly, you probably don't want to replicate all the data on
machine A to machine B, just some of it.

Typically, sharding solutions store multiple copies of each piece of
data. So let's say you have 4 machines. You divide the data into 12
chunks. Each machine is the write-master for 2 of those chunks, but
has secondary copies of 3 others. So maybe things start out like
this:

machine #1: master for chunks 1, 2, 3; also has copies of chunks 4, 7, 10
machine #2: master for chunks 4, 5, 6; also has copies of chunks 1, 8, 11
machine #3: master for chunks 7, 8, 9; also has copies of chunks 2, 5, 12
machine #4: master for chunks 10, 11, 12; also has copies of chunks 3, 6, 9

If machine #1 is run over by a rabid triceratops, you can make machine
#2 the master for chunk 1, machine #3 the master for chunk 2, and
machine #4 the master for chunk 3. The write load therefore remains
evenly divided. If you can only copy entire machines, you can't
achieve that in this situation.

I see the advantage of this now. My original idea is that each shard
would have its own standby for disaster recovery, but your approach
above, which I know is typical, allows the shards to back up each other.
You could say shard 2 is the backup for shard 1, but then if shard one
goes bad, the entire workload of shard 1 goes to shard 2. With the
above approach, the load of shard 1 is shared by all the shards.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#94)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:

Can you explain why logical replication is better than binary
replication for this use-case?

Selectivity?

I was assuming you would just create identical slaves to handle failure,
rather than moving selected data around.

Yes, I also think so, otherwise when the shard goes down and it's replica
has to take the place of shard, it will take more time to make replica
available as it won't have all the data as original shard had.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#97

Pavan Deolasee

pavan.deolasee@gmail.com

over 10 years ago

In reply to: Oleg Bartunov (#80)

Re: Horizontal scalability/sharding

On Wed, Sep 2, 2015 at 9:04 PM, Oleg Bartunov <obartunov@gmail.com> wrote:

One lesson from XL we got is that we need testing framework for cluster,
so any cluster project should at least pass functional and performance
testing.

+1. In early XC days, we focused a lot on adding newer features and
supporting as many PG features as possible. That took its toll on the
testing and QA. It was a mistake though my feeling was we tried to correct
that to some extend with XL. We did a 9.5 merge, which of course was a big
deal, but other than more time is being spent on improving stability and
performance

XL was very easy to break and I'm wondering how many corner cases still

exists.

Your team reported 2 or 3 major issues which I think we were able to fix
quite quickly. But if there are more such issues which your team has
recorded somewhere, I would request you to send them to the XL mailing
list. I would definitely want to look at them and address them.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#98

Tatsuo Ishii

ishii@postgresql.org

over 10 years ago

In reply to: Pavan Deolasee (#97)

Re: Horizontal scalability/sharding

One lesson from XL we got is that we need testing framework for cluster,
so any cluster project should at least pass functional and performance
testing.

+1. In early XC days, we focused a lot on adding newer features and
supporting as many PG features as possible. That took its toll on the
testing and QA. It was a mistake though my feeling was we tried to correct
that to some extend with XL. We did a 9.5 merge, which of course was a big
deal, but other than more time is being spent on improving stability and
performance

Agreed. Any cluster project needs a cluster testing framework.

pgpool-II project runs "build farm" which runs cluster regression
tests every day. The tests includes several versions of pgpool-II and
PostgreSQL combinations using docker. Still it needs more tests but
even with limited test cases, it is pretty usefull to detect bugs.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Petr Jelinek

petr@2ndquadrant.com

over 10 years ago

In reply to: Tatsuo Ishii (#98)

Re: Horizontal scalability/sharding

On 2015-09-02 19:57, Josh Berkus wrote:

On 09/01/2015 04:14 PM, Petr Jelinek wrote:

On 2015-09-02 00:09, Josh Berkus wrote:

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node. In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

Is there a way to do logical synchronous replication? I didn't think
there was.

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#100

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Petr Jelinek (#99)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 10:33:12AM +0200, Petr Jelinek wrote:

On 2015-09-02 19:57, Josh Berkus wrote:

On 09/01/2015 04:14 PM, Petr Jelinek wrote:

On 2015-09-02 00:09, Josh Berkus wrote:

Not really, the mechanism is different and the behavior is different.
One critical deficiency in using binary syncrep is that you can't do
round-robin redundancy at all; every redundant node has to be an exact
mirror of another node. In a good HA distributed system, you want
multiple shards per node, and you want each shard to be replicated to a
different node, so that in the event of node failure you're not dumping
the full load on one other server.

This assumes that we use binary replication, but we can reasonably use
logical replication which can quite easily do filtering of what's
replicated where.

Is there a way to do logical synchronous replication? I didn't think
there was.

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

Yes, I assumed that. Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too. However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL. This is true of binary and logical replication.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Amit Kapila (#96)

Re: Horizontal scalability/sharding

Hi,

On 09/03/2015 05:02 AM, Amit Kapila wrote:

On Thu, Sep 3, 2015 at 8:28 AM, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

On Wed, Sep 2, 2015 at 07:50:25PM -0700, Joshua Drake wrote:

Can you explain why logical replication is better than binary
replication for this use-case?

Selectivity?

I was assuming you would just create identical slaves to handle
failure, rather than moving selected data around.

Yes, I also think so, otherwise when the shard goes down and it's
replica has to take the place of shard, it will take more time to
make replica available as it won't have all the data as original
shard had.

Not really, the idea is that you don't need to create the replica
immediately. The system recognizes that primary shard location is
unavailable and redirects the tasks to the "replicas." So the time to
recreate the failed node is not that critical.

It needs to be done in a smart way to prevent some typical issues like
suddenly doubling the load on replicas due to failure of the primary
location. By using different group of nodes for each "data segment" you
can eliminate this, because the group of nodes to handle the additional
load will be larger.

The other issue then of course is that the groups of nodes must not be
entirely random, otherwise the cluster would suffer data loss in case of
outage of arbitrary group of K nodes (where K is the number of replicas
for each piece of data).

It's also non-trivial to do this when you have to consider racks, data
centers etc.

With regular slaves you can't do any of this - no matter what you do,
you have to load balance the additional load only on the slaves.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Kevin Grittner

kgrittn@ymail.com

over 10 years ago

In reply to: Josh Berkus (#88)

Re: Horizontal scalability/sharding

Josh Berkus <josh@agliodbs.com> wrote:

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

How is that different from the previous thing? Just that we'd treat
"lagging" as "down" beyond some threshold? That doesn't seem like a
mandatory feature.

It's a mandatory feature if you want to load-balance reads. We have to
know which nodes not to send reads to because they are out of sync.

There is another approach to this that we should consider how (if?)
we are going to cover: database affinity. I have seen cases where
there are multiple databases which are targets of asynchronous
replication, with a web application load balancing among them. The
application kept track of which copy each connection was using, so
that if when they were not exactly in sync the user never saw "time
moving backward". Two different users might see versions of the
data from different points in time, but that generally doesn't
matter, especially if the difference is just a few minutes. If one
copy got too far behind for some reason, they would load-shift to
the other servers (time still moves forward, only there is a "jump"
forward at the shift). This would allow the tardy database to be
dedicated to catching up again.

Bottom line is that this very smooth behavior required two features
-- the ability for the application to control database affinity,
and the ability to shift that affinity gracefully (with no down
time).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Tomas Vondra (#101)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 03:40:40PM +0200, Tomas Vondra wrote:

Not really, the idea is that you don't need to create the replica
immediately. The system recognizes that primary shard location is
unavailable and redirects the tasks to the "replicas." So the time
to recreate the failed node is not that critical.

It needs to be done in a smart way to prevent some typical issues
like suddenly doubling the load on replicas due to failure of the
primary location. By using different group of nodes for each "data
segment" you can eliminate this, because the group of nodes to
handle the additional load will be larger.

The other issue then of course is that the groups of nodes must not
be entirely random, otherwise the cluster would suffer data loss in
case of outage of arbitrary group of K nodes (where K is the number
of replicas for each piece of data).

It's also non-trivial to do this when you have to consider racks,
data centers etc.

With regular slaves you can't do any of this - no matter what you
do, you have to load balance the additional load only on the slaves.

Yes, and imagine doing this with FDW's, updating the catalog table
location of the FDW as part of the failover process --- interesting.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#100)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote:

Yes, I assumed that. Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too. However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL. This is true of binary and logical replication.

But, Thomas Munro is fixing it!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Andres Freund

andres@anarazel.de

over 10 years ago

In reply to: Robert Haas (#104)

Re: Horizontal scalability/sharding

On 2015-09-03 12:10:08 -0400, Robert Haas wrote:

On Thu, Sep 3, 2015 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote:

Yes, I assumed that. Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too. However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL. This is true of binary and logical replication.

Actually that's not really true - it's just a question which LSNs you
return. For UDR/BDR the relevant LSN is the LSN of the last durably
committed transaction. And thus they wait for apply, not anything else.

But, Thomas Munro is fixing it!

+ many to that effort.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Petr Jelinek (#99)

Re: Horizontal scalability/sharding

On 09/03/2015 03:57 AM, Bruce Momjian wrote:

Yes, the logical replication has similar syncrep properties as the
binary one (feedback works same way).

Oh? What does UDR/BDR currently support for sync?

Yes, I assumed that. Logical replication uses WAL, so if you are
synchronous with WAL, logical replication is synchronous too. However,
of course, it is synchronous in being durable, not synchronous in terms
of applying the WAL. This is true of binary and logical replication.

Well, there's no such thing as simultaneity in scalable architectures.
But users are already used to that ... anybody who load-balances to read
slaves knows about lag. The only way* to ensure near-simultenaity is to
have some kind of single-node, single-process GTM for the cluster, and
then your actual scalability goes bye-bye.

The bigger issue we'll need to address with this is the fight between
lag and load-balancing, which would become a much worse issue with
read-load-balanced shards which are transparent to the user. They'd see
the effects of lag, without having actually chosen to use this or that
replica. This is the other reason to look at logical replication;
presumably with logrep, we can be more discriminating about what
activities cause lag (for one thing, vacuum won't).

Also:
On 09/03/2015 07:00 AM, Kevin Grittner wrote:

There is another approach to this that we should consider how (if?)
we are going to cover: database affinity. I have seen cases where
there are multiple databases which are targets of asynchronous
replication, with a web application load balancing among them. The
application kept track of which copy each connection was using, so
that if when they were not exactly in sync the user never saw "time
moving backward". Two different users might see versions of the
data from different points in time, but that generally doesn't
matter, especially if the difference is just a few minutes. If one
copy got too far behind for some reason, they would load-shift to
the other servers (time still moves forward, only there is a "jump"
forward at the shift). This would allow the tardy database to be
dedicated to catching up again.

Bottom line is that this very smooth behavior required two features
-- the ability for the application to control database affinity,
and the ability to shift that affinity gracefully (with no down
time).

Yes. Frankly, it would be *easier* to code things so that the same
session always gets its requests load balanced to the same copies;
making that a feature, too, is nice.

(* there are actually other ways to come close to simultaneity, but they
are much more complicated)

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM89186556fea89477b7859b17162640043e97360d2434dad741779479789922c20c5124a0c15e2009713729dd864ef59e@asav-2.01.com

#107

Ozgun Erdogan

ozgun@citusdata.com

over 10 years ago

In reply to: Robert Haas (#42)

Re: Horizontal scalability/sharding

Hey Robert,

Now the question is, where should the code that does all of this live?

postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
for pg_shard, we thought that Map/Reduce would require a comprehensive
revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented
cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun

#108

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Ozgun Erdogan (#107)

Re: Horizontal scalability/sharding

On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:

Hey Robert,

Now the question is, where should the code that does all of this live?

postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
for pg_shard, we thought that Map/Reduce would require a comprehensive
revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We implemented
cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

I didn't get this. Join pushdown infrastructure (chiefly set of hooks
provided in join planning paths) is part of 9.5. Isn't that sufficient to
implement join push-down for above FDWs? Or FDW writers are facing problems
while implementing those hooks. In either case that should be reported on
hackers.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#109

Ahsan Hadi

ahsan.hadi@enterprisedb.com

over 10 years ago

In reply to: Ashutosh Bapat (#108)

Re: Horizontal scalability/sharding

On Monday, September 7, 2015, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com
<javascript:_e(%7B%7D,'cvml','ozgun@citusdata.com');>> wrote:

Hey Robert,

Now the question is, where should the code that does all of this live?

postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
for pg_shard, we thought that Map/Reduce would require a comprehensive
revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We
implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't
benefit from even the simple join pushdown that doesn't require Map/Reduce.

I didn't get this. Join pushdown infrastructure (chiefly set of hooks
provided in join planning paths) is part of 9.5. Isn't that sufficient to
implement join push-down for above FDWs? Or FDW writers are facing problems
while implementing those hooks. In either case that should be reported on
hackers.

I don't think any FDW writer (other the postgres_fdw) has tried to
implement join push down in the respective FDW's using the new API.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Best,
Ozgun

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Ahsan Hadi
Snr Director Product Development
EnterpriseDB Corporation
The Enterprise Postgres Company

Phone: +92-51-8358874
Mobile: +92-333-5162114

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

#110

Thomas Munro

thomas.munro@enterprisedb.com

over 10 years ago

In reply to: Josh Berkus (#85)

Re: Horizontal scalability/sharding

On Thu, Sep 3, 2015 at 7:03 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 09/02/2015 11:41 AM, Robert Haas wrote:

4. Therefore, I think that we should instead use logical replication,
which might be either synchronous or asynchronous. When you modify
one copy of the data, that change will then be replicated to all other
nodes. If you are OK with eventual consistency, this replication can
be asynchronous, and nodes that are off-line will catch up when they
are on-line. If you are not OK with that, then you must replicate
synchronously to every node before transaction commit; or at least you
must replicate synchronously to every node that is currently on-line.
This presents some challenges: logical decoding currently can't
replicate transactions that are still in process - replication starts
when the transaction commits. Also, we don't have any way for
synchronous replication to wait for multiple nodes.

Well, there is a WIP patch for that, which IMHO would be much improved
by having a concrete use-case like this one. What nobody is working on
-- and we've vetoed in the past -- is a way of automatically failing and
removing from replication any node which repeatedly fails to sync, which
would be a requirement for this model.

You'd also need a way to let the connection nodes know when a replica
has fallen behind so that they can be taken out of
load-balancing/sharding for read queries. For the synchronous model,
that would be "fallen behind at all"; for asynchronous it would be
"fallen more than ### behind".

I have been thinking about that problem in the context of
synchronous_commit = apply, and while trying to review the multiple
synchronous patch. How are you supposed to actually make use of
remote-apply semantics without a way to find a replica that is consistent?
And what does consistent mean? I'm going to say consistency means: it sees
at least all effects of all preceding COMMITs that returned successfully.
It's trivial in a no-timeout, single standby topology (if commit returned,
the sole sync replica has applied your transaction and replied), but beyond
that it obviously requires some more infrastructure and concepts. Here is
my suggestion:

Imagine if we could configure setups like this:

1. I have 4 servers called london1, london2, paris1, paris2 (see nearby
multiple sync server thread/patch).
2. I have synchronous_commit = apply (see nearby thread/patch)
3. Currently london1 is master, with the other 3 in the synchronous
replication set, and I want a minimum replication set of size 2 so I can
lose 1 of those and continue

So far so good, now for the double-vapourware part:

4. The replicas know whether they are currently part of the master's
synchronous replication set or not because it tells them
5. While waiting for replicas, the master only waits up to X milliseconds,
and if no reply is received from a given server it drops that server from
the sync rep set, like a RAID controller dropping an unresponsive element
from a RAID array, but still returns control to the user if 2 nodes (our
configured minimum) did reply
6. If the sync rep set reaches our minimum size 2 because of a node being
dropped, then you can no longer drop one, so commit hangs (work is blocked
until enough servers connect and catch up again)
7. If a replica sees that it hasn't received WAL records or pings from the
master with timestamps from the past Y milliseconds, or receives a message
explicitly telling it it's been dropped from the sync rep set, it will
start rejecting my queries on the basis that it's potentially out of date
8. If the master sees suitable apply reply messages stream in from a node
that was dropped but has now caught up (possibly having reconnected), it
will explicitly tell it that it's back in the sync rep set and start
waiting for it again
9. X is sufficiently larger than Y so that, combined with the 'you've been
dropped from/rejoined the sync rep set' messages and well sync'd system
clocks, it should not be possible for a replica to expose snapshots that
don't include all effects from transactions whose COMMIT command returned
on the master. (If you don't have the concept of a dynamic replication set
which replicas are dropped from and rejoin, then the master can't drop a
node and carry on, like a RAID controller would, unless it's happy to wait
for any old N nodes to reply. Waiting for any N nodes to reply may be OK
for log-flush-only sync rep, but if you want to use apply sync rep and have
guarantees about visibility, and you simply wait for any N nodes to reply,
then the nodes themselves don't know if they are up to date with master or
not (they don't know if they were one of the ones that master waited for
some transaction they haven't even heard about yet), so they may show users
old data. Also, if you don't have the master telling the replicas that it
considers them to be in or out of the replication set, they don't know
exactly when the master decides to consider them in again after they
rejoin.)

Now I can connect to any server and specify my requirement to see all
committed-on-the-master transactions (meaning: the COMMIT command returned
success to the client), and either get an error telling me that it can't
guarantee that at that moment (because it has been told it's not in the
sync rep set or hasn't heard from the master recently enough), or happily
proceed to query the database. I can send my writes to the master node,
and do all my reads on any node I like, and be sure they include whatever I
just committed (or someone else just committed and told me about). I can
also use FDW to query the replicas from the master and know that they can
see everything already committed (but of course not uncommitted changes;
I'm guessing you get that on the GTM based systems).

The main problem I can see so far with this scheme is that you can see
things on the replicas *before* the COMMIT returns. Is that a problem? I
suspect all solutions to that problem involve centralised snapshot control
(through a GTM or through the master).

The timeout and ping based drop/join idea is inspired by a non-Postgres
system that was presented at CHAR(14) last year that is due to be released
as open source one day (and whose author egged me on to try implementing
that synchronous_commit = apply patch), and I guess loosely RAID. Maybe
there is better terminology or a name in database literature for this
approach, I'm not sure, and maybe it has unacceptable holes. It's a lot
less radical than the GTM/MPP systems, since it just adds a few bells and
whistles to the existing single master replication model, and obviously
there are plenty more problems to solve to make really useful clustering
technology, like master reelection, query routing/node location, load
balancing and so forth. But the master and the replicas have the
information they need to do that.

--
Thomas Munro
http://www.enterprisedb.com

#111

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Ozgun Erdogan (#107)

Re: Horizontal scalability/sharding

On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote:

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs for
pg_shard, we thought that Map/Reduce would require a comprehensive revamp of
the APIs.

Well, so you've said. But what kind of API do you want to see?
Taking control at some very high-level hook like ExecutorRun() is not
really a maintainable solution - it's fine if you've only got one guy
doing it, perhaps, but if you have several FDWs talking to different
kinds of remote systems, they can't all seize overall control.

For Citus, a second part of the question is as FDW writers. We implemented

cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

What do you mean by "support for joins"? Do you mean that only 18 of
the remote data sources can do joins? If so, why does that matter?
I'd be quite happy if a join pushdown or "distributed shuffle" API had
as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
The fact that not all FDWs can support every operation because of
limitations on the remote side isn't a reason not to support those
operations when the remote side is capable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Petr Jelinek

petr@2ndquadrant.com

over 10 years ago

In reply to: Robert Haas (#111)

Re: Horizontal scalability/sharding

On 2015-09-08 19:52, Robert Haas wrote:

On Fri, Sep 4, 2015 at 6:52 PM, Ozgun Erdogan <ozgun@citusdata.com> wrote:

For Citus, a second part of the question is as FDW writers. We implemented

cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't benefit from
even the simple join pushdown that doesn't require Map/Reduce.

The PostgreSQL wiki lists 85 foreign data wrappers, and only 18 of these
have support for joins:
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

What do you mean by "support for joins"? Do you mean that only 18 of
the remote data sources can do joins? If so, why does that matter?
I'd be quite happy if a join pushdown or "distributed shuffle" API had
as many as 18 users - I'd be quite happy if it had one (postgres_fdw).
The fact that not all FDWs can support every operation because of
limitations on the remote side isn't a reason not to support those
operations when the remote side is capable.

Agreed. While I personally don't think FDWs are long term answer for
sharding, I do believe that the infrastructure that is being developed
for FDWs (join/aggregate pushdown) is needed anyway and there are many
common issues that need solving in this area for FDWs, sharding and
parallel query.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Merlin Moncure

mmoncure@gmail.com

over 10 years ago

In reply to: Ahsan Hadi (#109)

Re: Horizontal scalability/sharding

On Mon, Sep 7, 2015 at 1:33 PM, Ahsan Hadi <ahsan.hadi@enterprisedb.com> wrote:

I

On Monday, September 7, 2015, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Sep 5, 2015 at 4:22 AM, Ozgun Erdogan <ozgun@citusdata.com> wrote:

Hey Robert,

Now the question is, where should the code that does all of this live?
postgres_fdw? Some new, sharding-specific FDW? In core? I don't
know for sure, but what I do know is that we could make a lot of
progress over where we are today by just improving postgres_fdw, and I
don't think those improvements are even all that difficult. If we
decide we need to implement something new, it's going to be a huge
project that will take years to complete, with uncertain results. I'd
rather have a postgres_fdw-based implementation that is imperfect and
can't handle some kinds of queries in 9.6 than a promise that by 9.9
we'll have something really great that handles MPP perfectly.

Distributed shuffles (Map/Reduce) are hard. When we looked at using FDWs
for pg_shard, we thought that Map/Reduce would require a comprehensive
revamp of the APIs.

For Citus, a second part of the question is as FDW writers. We
implemented cstore_fdw, json_fdw, and mongo_fdw, and these wrappers don't
benefit from even the simple join pushdown that doesn't require Map/Reduce.

I didn't get this. Join pushdown infrastructure (chiefly set of hooks
provided in join planning paths) is part of 9.5. Isn't that sufficient to
implement join push-down for above FDWs? Or FDW writers are facing problems
while implementing those hooks. In either case that should be reported on
hackers.

I don't think any FDW writer (other the postgres_fdw) has tried to implement
join push down in the respective FDW's using the new API.

Well, 'jdbc_fdw2' seems to implement deparsing at some level:
https://github.com/heimir-sverrisson/jdbc2_fdw/blob/master/deparse.c

...but this was likely a copy/paste job from the postgres_fdw. It
should not escape note that the deparsing strategy has dependencies on
the optimizer and the parser. This is not good; asking the FDW
implementations to implement SQL optimizations is not a sustainable
solution. They should be indicating, for example, "I support ANSI-92
SQL", and the postgres planner should be deparsing the foreign table
definition and rewriting it, not the other way around.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Horizontal scalability/sharding

Attachments: