PostgreSQL Volume Question

Started by Data Acealmost 8 years ago23 messagesgeneral

dataace9@gmail.com

almost 8 years ago

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and
the problem is that with PostgreSQL, it would be so dfficult to handle this
kind of data. Are there any PG extension modules or methods that are
recommended for my project?

Thanks in advance.

Ravi Krishna

sravikrishna3@gmail.com

almost 8 years ago

In reply to: Data Ace (#1)

Re: PostgreSQL Volume Question

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?

Can you give a number to "huge volume" and how did you conclude that PG can not handle it.

Adrian Klaver

adrian.klaver@aklaver.com

almost 8 years ago

In reply to: Data Ace (#1)

Re: PostgreSQL Volume Question

On 06/14/2018 02:33 PM, Data Ace wrote:

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to
handle this kind of data. Are there any PG extension modules or methods
that are recommended for my project?

In addition to Ravi's questions:

What does the data look like?

What Postgres version?

How is the data going to get from A <--> B, local or remotely or both?

Is there another database or program involved in the process?

Thanks in advance.

--
Adrian Klaver
adrian.klaver@aklaver.com

Melvin Davidson

melvin6925@gmail.com

almost 8 years ago

In reply to: Adrian Klaver (#3)

Re: PostgreSQL Volume Question

On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 06/14/2018 02:33 PM, Data Ace wrote:

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?

In addition to Ravi's questions:

What does the data look like?

What Postgres version?

How is the data going to get from A <--> B, local or remotely or both?

Is there another database or program involved in the process?

Thanks in advance.

--
Adrian Klaver
adrian.klaver@aklaver.com

In addition to Ravi's and Adrian's questions:

What is the hardware configuration?

--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!

Steven Lembark

lembark@wrkhors.com

almost 8 years ago

In reply to: Data Ace (#1)

Re: PostgreSQL Volume Question

On Thu, 14 Jun 2018 14:33:54 -0700
Data Ace <dataace9@gmail.com> wrote:

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social
network data analysis service (and my client's DBMS is based on
PostgreSQL). I need to gather huge volume of unstructured raw data
for this project, and the problem is that with PostgreSQL, it would
be so dfficult to handle this kind of data. Are there any PG
extension modules or methods that are recommended for my project?

"huge" by modern standards is Petabytes, which might require some
specialized database service for a data lake.

Short of that look up the "jsonb" data type in Postgres.
The nice thing about using PG for this is that you can keep enough
identifying and metadata in a relational system where it is easier
to query and the documents in jsonb where they are still accessable.

--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508

Data Ace

dataace9@gmail.com

almost 8 years ago

In reply to: Melvin Davidson (#4)

Re: PostgreSQL Volume Question

Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(

Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.

I've searched and found that graph model nicely fits for network data like
social data in query performance.

Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?

Thanks

On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com>
wrote:

Show quoted text

On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 06/14/2018 02:33 PM, Data Ace wrote:

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?

In addition to Ravi's questions:

What does the data look like?

What Postgres version?

How is the data going to get from A <--> B, local or remotely or both?

Is there another database or program involved in the process?

Thanks in advance.

--
Adrian Klaver
adrian.klaver@aklaver.com

In addition to Ravi's and Adrian's questions:

What is the hardware configuration?

--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!

Melvin Davidson

melvin6925@gmail.com

almost 8 years ago

In reply to: Data Ace (#6)

Re: PostgreSQL Volume Question

On Fri, Jun 15, 2018 at 12:26 PM, Data Ace <dataace9@gmail.com> wrote:

Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(

Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.

I've searched and found that graph model nicely fits for network data like
social data in query performance.

Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?

Thanks

On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com>
wrote:

On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com

wrote:

On 06/14/2018 02:33 PM, Data Ace wrote:

Hi, I'm new to the community.

Recently, I've been involved in a project that develops a social
network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?

In addition to Ravi's questions:

What does the data look like?

What Postgres version?

How is the data going to get from A <--> B, local or remotely or both?

Is there another database or program involved in the process?

Thanks in advance.

--
Adrian Klaver
adrian.klaver@aklaver.com

In addition to Ravi's and Adrian's questions:

What is the hardware configuration?

--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!

At this point, your are still giving general instead of specific answers.
It is most important to answer Adrian's and my quesions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
What is the O/S?
What does the hardware configuration look like?

the problem is that it needs too many join operations and the analysis

process is going too slow right now.

So what is the structure of the tables involved, including indexes?

What is the actual query?

We cannot help unless you give us specifics to work with.

--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!

Ron

ronljohnsonjr@gmail.com

almost 8 years ago

In reply to: Data Ace (#6)

Re: PostgreSQL Volume Question

On 06/15/2018 11:26 AM, Data Ace wrote:

Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(

Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.

I've searched and found that graph model nicely fits for network data like
social data in query performance.

If your data is hierarchal, then storing it in a network database is
perfectly reasonable.\302\240 I'm not sure, though, that there are many network
databases for Linux.\302\240 Raima is the only one I can think of.

Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?

Thanks

--
Angular momentum makes the world go 'round.

Pierre Timmermans

ptim007@yahoo.com

almost 8 years ago

In reply to: Ron (#8)

using pg_basebackup for point in time recovery

Hi,I find the documentation about pg_basebackup misleading : the documentation states that standalone hot backups cannot be used for point in time recovery, however I don't get the point : if one has a combination of the nightly pg_basebackup and the archived wals, then it is totally OK to do point in time I assume ? (of course the recovery.conf must be manually changed to set the restore_command and the recovery target time) Here is the doc, the sentence that I find misleading is "There are backups that cannot be used for point-in-time recovery", also mentioning that they are faster than pg_dumps add to confusion (since pg_dumps cannot be used for PITR)Doc: https://www.postgresql.org/docs/current/static/continuous-archiving.html
It is possible to use PostgreSQL's backup facilities to produce standalone hot backups. These are backups that cannot be used for point-in-time recovery, yet are typically much faster to backup and restore than pg_dump dumps. (They are also much larger than pg_dump dumps, so in some cases the speed advantage might be negated.)
As with base backups, the easiest way to produce a standalone hot backup is to use the pg_basebackup tool. If you include the -X parameter when calling it, all the write-ahead log required to use the backup will be included in the backup automatically, and no special action is required to restore the backup.
Thanks and regards,

Pierre

On Tuesday, June 19, 2018, 1:38:40 PM GMT+2, Ron <ronljohnsonjr@gmail.com> wrote:

On 06/15/2018 11:26 AM, Data Ace wrote:

Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(

Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.

I've searched and found that graph model nicely fits for network data like social data in query performance.

If your data is hierarchal, then storing it in a network database is perfectly reasonable. I'm not sure, though, that there are many network databases for Linux. Raima is the only one I can think of.

Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?

Thanks

--
Angular momentum makes the world go 'round.

#10

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Pierre Timmermans (#9)

Re: using pg_basebackup for point in time recovery

Hi Pierre,

On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote:

Here is the doc, the sentence that I find misleading is "There are
backups that cannot be used for point-in-time recovery", also
mentioning that they are faster than pg_dumps add to confusion (since
pg_dumps cannot be used for PITR):
https://www.postgresql.org/docs/current/static/continuous-archiving.html

Yes, it is indeed perfectly possible to use such backups to do a PITR
as long as you have a WAL archive able to replay up to the point where
you want the replay to happen, so I agree that this is a bit confusing.
This part of the documentation is here since the beginning of times,
well 6559c4a2 to be exact. Perhaps we would want to reword this
sentence as follows:
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first."

I am open to better suggestions of course.
--
Michael

#11

Pierre Timmermans

ptim007@yahoo.com

almost 8 years ago

In reply to: Michael Paquier (#10)

Re: using pg_basebackup for point in time recovery

Hi Michael
Thanks for the confirmation. Your rewording removes the confusion. I would maybe take the opportunity to re-instate that pg_dump cannot be used for PITR, so in the line of
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first. Consider also that pg_dump backups cannot be used for point-in-time recovery."

Maybe the confusion stems from the fact that if you restore a standalone (self-contained) pg_basebackup then - by default - recovery is done with the recovery_target immediate option, so if one needs point-in-time recovery he has to edit the recovery.conf and brings the archives..

Thanks and regards,
Pierre

On Wednesday, June 20, 2018, 5:38:56 AM GMT+2, Michael Paquier <michael@paquier.xyz> wrote:

Hi Pierre,

On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote:

Here is the doc, the sentence that I find misleading is "There are
backups that cannot be used for point-in-time recovery", also
mentioning that they are faster than pg_dumps add to confusion (since
pg_dumps cannot be used for PITR):
https://www.postgresql.org/docs/current/static/continuous-archiving.html

I am open to better suggestions of course.
--
Michael

#12

Thomas Kellerer

spam_eater@gmx.net

almost 8 years ago

In reply to: Data Ace (#6)

Re: PostgreSQL Volume Question

Data Ace schrieb am 15.06.2018 um 18:26:

Well I think my question is somewhat away from my intention cause of
my poor understanding and questioning :(

Actually, I have 1TB data and have hardware spec enough to handle
this amount of data, but the problem is that it needs too many join
operations and the analysis process is going too slow right now.

I've searched and found that graph model nicely fits for network data
like social data in query performance.

Should I change my DB (I mean my DB for analysis)? or do I need some
other solutions or any extension?

AgensGraph is a Postgres fork implemententing a graph database supporting
Cypher as the query language while at the same time still supporting SQL
(and even queries mixing both)

I have never used it, but maybe it's worth a try.

http://bitnine.net/agensgraph/

Thomas

#13

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Pierre Timmermans (#11)

Re: using pg_basebackup for point in time recovery

Hi Pierre,

On Wed, Jun 20, 2018 at 08:06:31AM +0000, Pierre Timmermans wrote:

Hi Michael

You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)

Thanks for the confirmation. Your rewording removes the confusion. I
would maybe take the opportunity to re-instate that pg_dump cannot be
used for PITR, so in the line of
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first. Consider also that pg_dump backups cannot be used for
point-in-time recovery."

Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.

Maybe the confusion stems from the fact that if you restore a
standalone (self-contained) pg_basebackup then - by default - recovery
is done with the recovery_target immediate option, so if one needs
point-in-time recovery he has to edit the recovery.conf and brings the
archives..

Perhaps. There is really nothing preventing one to add a recovery.conf
afterwards, which is also why pg_basebackup -R exists. I do that as
well for some of the framework I work with and maintain.
--
Michael

#14

Ron

ronljohnsonjr@gmail.com

almost 8 years ago

In reply to: Michael Paquier (#13)

Re: using pg_basebackup for point in time recovery

On 06/21/2018 12:27 AM, Michael Paquier wrote:
[snip]

Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.

You've *got* to be kidding.

Fixing an ambiguously or poorly worded bit of *documentation* should
obviously be pushed to all affected versions.

--
Angular momentum makes the world go 'round.

#15

Pierre Timmermans

ptim007@yahoo.com

almost 8 years ago

In reply to: Michael Paquier (#13)

Re: using pg_basebackup for point in time recovery

Hi Michael
On Thursday, June 21, 2018, 7:28:13 AM GMT+2, Michael Paquier <michael@paquier.xyz> wrote:

You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)

Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself

Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.

Yes I think it is now perfectly clear. Much appreciated to have the chance to contribute to the doc by the way, it is very nice

Perhaps. There is really nothing preventing one to add a recovery.conf
afterwards, which is also why pg_basebackup -R exists. I do that as
well for some of the framework I work with and maintain.

I just went to the doc to check about this -R option :-)
Pierre

#16

Ravi Krishna

srkrishna@yahoo.com

almost 8 years ago

In reply to: Pierre Timmermans (#15)

Re: using pg_basebackup for point in time recovery

You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)

Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself

Same here even though I use Mac mail. But it is not yahoo alone.
Most of the web email clients have resorted to top posting. I miss the old
days of Outlook Express which was so '>' friendly. I think Gmail allows
'>' when you click on the dots to expand the mail you are replying to, but it messes
up in justifying and formatting it.

The best for '>': Unix elm :-)

#17

Vik Fearing

vik@postgresfriends.org

almost 8 years ago

In reply to: Michael Paquier (#13)

Re: using pg_basebackup for point in time recovery

On 21/06/18 07:27, Michael Paquier wrote:

Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.

Say what? If the clarification applies to previous versions, as it
does, it should be backpatched. This isn't a change in behavior, it's a
change in the description of existing behavior.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#18

David G. Johnston

david.g.johnston@gmail.com

almost 8 years ago

In reply to: Vik Fearing (#17)

Re: using pg_basebackup for point in time recovery

On Thu, Jun 21, 2018 at 4:26 PM, Vik Fearing <vik.fearing@2ndquadrant.com>
wrote:

On 21/06/18 07:27, Michael Paquier wrote:

Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.

Say what? If the clarification applies to previous versions, as it
does, it should be backpatched. This isn't a change in behavior, it's a
change in the description of existing behavior.

Generally only actual bug fixes get back-patched; but I'd have to say this
looks like it could easily be classified as one.

Before: These are backups that cannot be used for PITR
After: These are backups that could be used for PITR if ...

Changing a cannot to a can seems like we are fixing a bug in the
documentation.

Some comments on the patch itself:

"recover up to the wanted recovery point." - "desired recovery point" reads
better to me

====
"These backups are typically much faster to backup and restore" - "These
backups are typically much faster to create and restore"; avoid repeated
use of the word backup

"but can result as well in larger backup sizes" - "but can result in larger
backup sizes", drop the unnecessary 'as well'

"sizes, so the speed of one method or the other is to evaluate carefully
first" - that is just wrong as-is; suggest just removing it.
====

To cover the last three items as a whole I'd suggest:

"These backups are typically much faster to create and restore, but
generate larger file sizes, compared to pg_dump."

For the last sentence I'd suggest:

"Note that because WAL cannot be applied on top of a restored pg_dump
backup it is considered a cold backup and cannot be used for
point-in-time-recovery."

I like adding "cold backup" here to help contrast and explain why a base
backup is considered a "hot backup". The rest is style to make that flow
better.

David J.

#19

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Ravi Krishna (#16)

Re: using pg_basebackup for point in time recovery

On Thu, Jun 21, 2018 at 04:42:00PM -0400, Ravi Krishna wrote:

Same here even though I use Mac mail. But it is not yahoo alone.
Most of the web email clients have resorted to top posting. I miss
the old days of Outlook Express which was so '>' friendly. I think
Gmail allows '>' when you click on the dots to expand the mail you
are replying to, but it messes up in justifying and formatting it.

Those products have good practices when it comes to break and redefine
what the concept behind emails is...
--
Michael

#20

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: David G. Johnston (#18)

Re: using pg_basebackup for point in time recovery

On Thu, Jun 21, 2018 at 04:50:38PM -0700, David G. Johnston wrote:

Generally only actual bug fixes get back-patched; but I'd have to say
this looks like it could easily be classified as one.

Everybody is against me here ;)

Some comments on the patch itself:

"recover up to the wanted recovery point." - "desired recovery point" reads
better to me

====
"These backups are typically much faster to backup and restore" - "These
backups are typically much faster to create and restore"; avoid repeated
use of the word backup

Okay.

"but can result as well in larger backup sizes" - "but can result in larger
backup sizes", drop the unnecessary 'as well'

Okay.

I like adding "cold backup" here to help contrast and explain why a base
backup is considered a "hot backup". The rest is style to make that flow
better.

Indeed. The section uses hot backups a lot.

What do all folks here think about the updated attached?
--
Michael

#21