Directory pg_replslot is not properly cleaned

Started by Fabrízio de Royes Melloover 8 years ago9 messages

fabriziomello@gmail.com

over 8 years ago

1 attachment(s)

Hi all,

This week I faced a out of disk space trouble in 8TB production cluster.
During investigation we notice that pg_replslot was the culprit growing
more than 1TB in less than 1 (one) hour.

We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a new 9.6
instance and planning the upgrade soon.

What I did? I freed some disk space just to startup PostgreSQL and begin
the investigation. During the 'startup recovery' simply the files inside
the pg_replslot was tottaly removed. So our trouble with 'out of disk
space' disappear. Then the server went up and physical slaves attached
normally to master but logical slaves doesn't, staying stalled in 'catchup'
state.

At this moment the "pg_replslot" directory started growing fast again and
forced us to drop the logical replication slot and we lost the logical
slave.

Googling awhile I found this thread [1]/messages/by-id/1457621358.355011041@f382.i.mail.ru about a similar issue reported by
Dmitriy Sarafannikov and replied by Andres and Álvaro.

I ran the test case provided by Dmitriy [1]/messages/by-id/1457621358.355011041@f382.i.mail.ru against branches:
- REL9_4_STABLE
- REL9_5_STABLE
- REL9_6_STABLE
- master

After all test the issue remains... and also using the new Logical
Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.

Seems no one complain again about this issue and the thread was lost.

The attached is a reworked version of Dimitriy's patch that seems solve the
issue. I confess I don't know enough about replication slots code to really
know if it's the best solution.

Regards,

[1]: /messages/by-id/1457621358.355011041@f382.i.mail.ru
/messages/by-id/1457621358.355011041@f382.i.mail.ru

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

Attachments:

cleanup_subxacts_v0.patchtext/x-patch; charset=US-ASCII; name=cleanup_subxacts_v0.patchDownload

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 524946a..a538715 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1142,7 +1142,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	Assert(found);
 
 	/* remove entries spilled to disk */
-	if (txn->nentries != txn->nentries_mem)
+	if (txn->nentries != txn->nentries_mem || txn->is_known_as_subxact)
 		ReorderBufferRestoreCleanup(rb, txn);
 
 	/* deallocate */

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 8 years ago

In reply to: Fabrízio de Royes Mello (#1)

Re: Directory pg_replslot is not properly cleaned

On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <
fabriziomello@gmail.com> wrote:

Hi all,

This week I faced a out of disk space trouble in 8TB production cluster.

During investigation we notice that pg_replslot was the culprit growing
more than 1TB in less than 1 (one) hour.

We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a new

9.6 instance and planning the upgrade soon.

What I did? I freed some disk space just to startup PostgreSQL and begin

the investigation. During the 'startup recovery' simply the files inside
the pg_replslot was tottaly removed. So our trouble with 'out of disk
space' disappear. Then the server went up and physical slaves attached
normally to master but logical slaves doesn't, staying stalled in 'catchup'
state.

At this moment the "pg_replslot" directory started growing fast again and

forced us to drop the logical replication slot and we lost the logical
slave.

Googling awhile I found this thread [1] about a similar issue reported by

Dmitriy Sarafannikov and replied by Andres and Álvaro.

I ran the test case provided by Dmitriy [1] against branches:
- REL9_4_STABLE
- REL9_5_STABLE
- REL9_6_STABLE
- master

After all test the issue remains... and also using the new Logical

Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.

Seems no one complain again about this issue and the thread was lost.

The attached is a reworked version of Dimitriy's patch that seems solve

the issue. I confess I don't know enough about replication slots code to
really know if it's the best solution.

Regards,

[1]

/messages/by-id/1457621358.355011041@f382.i.mail.ru

Just adding Dimitriy to conversation... previous email I provided was wrong.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 8 years ago

In reply to: Fabrízio de Royes Mello (#2)

Re: Directory pg_replslot is not properly cleaned

On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
fabriziomello@gmail.com> wrote:

On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <

fabriziomello@gmail.com> wrote:

Hi all,

This week I faced a out of disk space trouble in 8TB production

cluster. During investigation we notice that pg_replslot was the culprit
growing more than 1TB in less than 1 (one) hour.

We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a new

9.6 instance and planning the upgrade soon.

What I did? I freed some disk space just to startup PostgreSQL and

begin the investigation. During the 'startup recovery' simply the files
inside the pg_replslot was tottaly removed. So our trouble with 'out of
disk space' disappear. Then the server went up and physical slaves attached
normally to master but logical slaves doesn't, staying stalled in 'catchup'
state.

At this moment the "pg_replslot" directory started growing fast again

and forced us to drop the logical replication slot and we lost the logical
slave.

Googling awhile I found this thread [1] about a similar issue reported

by Dmitriy Sarafannikov and replied by Andres and Álvaro.

I ran the test case provided by Dmitriy [1] against branches:
- REL9_4_STABLE
- REL9_5_STABLE
- REL9_6_STABLE
- master

After all test the issue remains... and also using the new Logical

Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.

Seems no one complain again about this issue and the thread was lost.

The attached is a reworked version of Dimitriy's patch that seems solve

the issue. I confess I don't know enough about replication slots code to
really know if it's the best solution.

Regards,

[1]

/messages/by-id/1457621358.355011041@f382.i.mail.ru

Just adding Dimitriy to conversation... previous email I provided was

wrong.

Does anyone have some thought about this critical issue?

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Fabrízio de Royes Mello (#3)

Re: Directory pg_replslot is not properly cleaned

On June 7, 2017 11:29:28 AM PDT, "Fabrízio de Royes Mello" <fabriziomello@gmail.com> wrote:

On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
fabriziomello@gmail.com> wrote:

On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <

fabriziomello@gmail.com> wrote:

Hi all,

This week I faced a out of disk space trouble in 8TB production

cluster. During investigation we notice that pg_replslot was the
culprit
growing more than 1TB in less than 1 (one) hour.

We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a

new
9.6 instance and planning the upgrade soon.

What I did? I freed some disk space just to startup PostgreSQL and

begin the investigation. During the 'startup recovery' simply the files
inside the pg_replslot was tottaly removed. So our trouble with 'out of
disk space' disappear. Then the server went up and physical slaves
attached
normally to master but logical slaves doesn't, staying stalled in
'catchup'
state.

At this moment the "pg_replslot" directory started growing fast

again
and forced us to drop the logical replication slot and we lost the
logical
slave.

Googling awhile I found this thread [1] about a similar issue

reported
by Dmitriy Sarafannikov and replied by Andres and Álvaro.

I ran the test case provided by Dmitriy [1] against branches:
- REL9_4_STABLE
- REL9_5_STABLE
- REL9_6_STABLE
- master

After all test the issue remains... and also using the new Logical

Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in
ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.

Seems no one complain again about this issue and the thread was

lost.

The attached is a reworked version of Dimitriy's patch that seems

solve
the issue. I confess I don't know enough about replication slots code
to
really know if it's the best solution.

Regards,

[1]

/messages/by-id/1457621358.355011041@f382.i.mail.ru

Just adding Dimitriy to conversation... previous email I provided was

wrong.

Does anyone have some thought about this critical issue?

I plan to look into it over the next few days.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 8 years ago

In reply to: Andres Freund (#4)

Re: Directory pg_replslot is not properly cleaned

On Wed, Jun 7, 2017 at 3:30 PM, Andres Freund <andres@anarazel.de> wrote:

On June 7, 2017 11:29:28 AM PDT, "Fabrízio de Royes Mello" <

fabriziomello@gmail.com> wrote:

On Fri, Jun 2, 2017 at 6:37 PM, Fabrízio de Royes Mello <
fabriziomello@gmail.com> wrote:

On Fri, Jun 2, 2017 at 6:32 PM, Fabrízio de Royes Mello <

fabriziomello@gmail.com> wrote:

Hi all,

This week I faced a out of disk space trouble in 8TB production

cluster. During investigation we notice that pg_replslot was the
culprit
growing more than 1TB in less than 1 (one) hour.

We're using PostgreSQL 9.5.6 with pglogical 1.2.2 replicating to a

new
9.6 instance and planning the upgrade soon.

What I did? I freed some disk space just to startup PostgreSQL and

begin the investigation. During the 'startup recovery' simply the files
inside the pg_replslot was tottaly removed. So our trouble with 'out of
disk space' disappear. Then the server went up and physical slaves
attached
normally to master but logical slaves doesn't, staying stalled in
'catchup'
state.

At this moment the "pg_replslot" directory started growing fast

again
and forced us to drop the logical replication slot and we lost the
logical
slave.

Googling awhile I found this thread [1] about a similar issue

reported
by Dmitriy Sarafannikov and replied by Andres and Álvaro.

I ran the test case provided by Dmitriy [1] against branches:
- REL9_4_STABLE
- REL9_5_STABLE
- REL9_6_STABLE
- master

After all test the issue remains... and also using the new Logical

Replication stuff (CREATE PUB/CREATE SUB). Just after a restart the
"pg_replslot" was properly cleaned. The typo in
ReorderBufferIterTXNInit
complained by Dimitriy was fixed but the issue remains.

Seems no one complain again about this issue and the thread was

lost.

The attached is a reworked version of Dimitriy's patch that seems

solve
the issue. I confess I don't know enough about replication slots code
to
really know if it's the best solution.

Regards,

[1]

/messages/by-id/1457621358.355011041@f382.i.mail.ru

Just adding Dimitriy to conversation... previous email I provided was

wrong.

Does anyone have some thought about this critical issue?

I plan to look into it over the next few days.

Thanks...

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

Andres Freund

andres@anarazel.de

over 8 years ago

In reply to: Fabrízio de Royes Mello (#5)

Re: Directory pg_replslot is not properly cleaned

Hi,

On 2017-06-07 15:46:45 -0300, Fabrï¿½zio de Royes Mello wrote:

Just adding Dimitriy to conversation... previous email I provided was

wrong.

Does anyone have some thought about this critical issue?

I plan to look into it over the next few days.

Thanks...

As noted in
http://archives.postgresql.org/message-id/20170619023014.qx7zjmnkzy3fwpfl%40alap3.anarazel.de
I've pushed a fix for this. Sorry for it taking this long.

The fix will be included in the next set of minor releases.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 8 years ago

In reply to: Andres Freund (#6)

Re: Directory pg_replslot is not properly cleaned

On Sun, Jun 18, 2017 at 11:32 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-06-07 15:46:45 -0300, Fabrízio de Royes Mello wrote:

Just adding Dimitriy to conversation... previous email I provided

was

wrong.

Does anyone have some thought about this critical issue?

I plan to look into it over the next few days.

Thanks...

As noted in

http://archives.postgresql.org/message-id/20170619023014.qx7zjmnkzy3fwpfl%40alap3.anarazel.de

I've pushed a fix for this. Sorry for it taking this long.

Don't worry... thank you so much.

The fix will be included in the next set of minor releases.

Do you know when the next minor versions will be released? Because
depending of the schedule I'll patch the current customer version because
we need the "pglogical" running stable.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Fabrízio de Royes Mello (#7)

Re: Directory pg_replslot is not properly cleaned

On Mon, Jun 19, 2017 at 10:58 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Do you know when the next minor versions will be released? Because depending
of the schedule I'll patch the current customer version because we need the
"pglogical" running stable.

The next round is planned for the 10th of August:
https://www.postgresql.org/developer/roadmap/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabrízio de Royes Mello

fabriziomello@gmail.com

over 8 years ago

In reply to: Michael Paquier (#8)

Re: Directory pg_replslot is not properly cleaned

On Mon, Jun 19, 2017 at 11:02 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Mon, Jun 19, 2017 at 10:58 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Do you know when the next minor versions will be released? Because

depending

of the schedule I'll patch the current customer version because we need

the

"pglogical" running stable.

The next round is planned for the 10th of August:
https://www.postgresql.org/developer/roadmap/

I completely forgot this web page ... thanks!

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello