PostgreSQL Volume Question
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and
the problem is that with PostgreSQL, it would be so dfficult to handle this
kind of data. Are there any PG extension modules or methods that are
recommended for my project?
Thanks in advance.
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project, and the problem is that with PostgreSQL, it would be so dfficult to handle this kind of data. Are there any PG extension modules or methods that are recommended for my project?
Can you give a number to "huge volume" and how did you conclude that PG can not handle it.
On 06/14/2018 02:33 PM, Data Ace wrote:
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to
handle this kind of data. Are there any PG extension modules or methods
that are recommended for my project?
In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.com
On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 06/14/2018 02:33 PM, Data Ace wrote:
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.com
In addition to Ravi's and Adrian's questions:
What is the hardware configuration?
--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!
On Thu, 14 Jun 2018 14:33:54 -0700
Data Ace <dataace9@gmail.com> wrote:
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social
network data analysis service (and my client's DBMS is based on
PostgreSQL). I need to gather huge volume of unstructured raw data
for this project, and the problem is that with PostgreSQL, it would
be so dfficult to handle this kind of data. Are there any PG
extension modules or methods that are recommended for my project?
"huge" by modern standards is Petabytes, which might require some
specialized database service for a data lake.
Short of that look up the "jsonb" data type in Postgres.
The nice thing about using PG for this is that you can keep enough
identifying and metadata in a relational system where it is easier
to query and the documents in jsonb where they are still accessable.
--
Steven Lembark 1505 National Ave
Workhorse Computing Rockford, IL 61103
lembark@wrkhors.com +1 888 359 3508
Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like
social data in query performance.
Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?
Thanks
On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com>
wrote:
Show quoted text
On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:On 06/14/2018 02:33 PM, Data Ace wrote:
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social network
data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.comIn addition to Ravi's and Adrian's questions:
What is the hardware configuration?
--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!
On Fri, Jun 15, 2018 at 12:26 PM, Data Ace <dataace9@gmail.com> wrote:
Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.I've searched and found that graph model nicely fits for network data like
social data in query performance.Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?Thanks
On Thu, Jun 14, 2018 at 3:36 PM, Melvin Davidson <melvin6925@gmail.com>
wrote:On Thu, Jun 14, 2018 at 6:30 PM, Adrian Klaver <adrian.klaver@aklaver.com
wrote:
On 06/14/2018 02:33 PM, Data Ace wrote:
Hi, I'm new to the community.
Recently, I've been involved in a project that develops a social
network data analysis service (and my client's DBMS is based on PostgreSQL).
I need to gather huge volume of unstructured raw data for this project,
and the problem is that with PostgreSQL, it would be so dfficult to handle
this kind of data. Are there any PG extension modules or methods that are
recommended for my project?In addition to Ravi's questions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
Thanks in advance.
--
Adrian Klaver
adrian.klaver@aklaver.comIn addition to Ravi's and Adrian's questions:
What is the hardware configuration?
--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!
At this point, your are still giving general instead of specific answers.
It is most important to answer Adrian's and my quesions:
What does the data look like?
What Postgres version?
How is the data going to get from A <--> B, local or remotely or both?
Is there another database or program involved in the process?
What is the O/S?
What does the hardware configuration look like?
the problem is that it needs too many join operations and the analysis
process is going too slow right now.
So what is the structure of the tables involved, including indexes?
What is the actual query?
We cannot help unless you give us specifics to work with.
--
*Melvin Davidson*
*Maj. Database & Exploration Specialist*
*Universe Exploration Command – UXC*
Employment by invitation only!
On 06/15/2018 11:26 AM, Data Ace wrote:
Well I think my question is somewhat away from my intention cause of my
poor understanding and questioning :(Actually, I have 1TB data and have hardware spec enough to handle this
amount of data, but the problem is that it needs too many join operations
and the analysis process is going too slow right now.I've searched and found that graph model nicely fits for network data like
social data in query performance.
If your data is hierarchal, then storing it in a network database is
perfectly reasonable.\302\240 I'm not sure, though, that there are many network
databases for Linux.\302\240 Raima is the only one I can think of.
Should I change my DB (I mean my DB for analysis)? or do I need some other
solutions or any extension?Thanks
--
Angular momentum makes the world go 'round.
Hi,I find the documentation about pg_basebackup misleading : the documentation states that standalone hot backups cannot be used for point in time recovery, however I don't get the point : if one has a combination of the nightly pg_basebackup and the archived wals, then it is totally OK to do point in time I assume ? (of course the recovery.conf must be manually changed to set the restore_command and the recovery target time) Here is the doc, the sentence that I find misleading is "There are backups that cannot be used for point-in-time recovery", also mentioning that they are faster than pg_dumps add to confusion (since pg_dumps cannot be used for PITR)Doc: https://www.postgresql.org/docs/current/static/continuous-archiving.html
It is possible to use PostgreSQL's backup facilities to produce standalone hot backups. These are backups that cannot be used for point-in-time recovery, yet are typically much faster to backup and restore than pg_dump dumps. (They are also much larger than pg_dump dumps, so in some cases the speed advantage might be negated.)
As with base backups, the easiest way to produce a standalone hot backup is to use the pg_basebackup tool. If you include the -X parameter when calling it, all the write-ahead log required to use the backup will be included in the backup automatically, and no special action is required to restore the backup.
Thanks and regards,
Pierre
On Tuesday, June 19, 2018, 1:38:40 PM GMT+2, Ron <ronljohnsonjr@gmail.com> wrote:
On 06/15/2018 11:26 AM, Data Ace wrote:
Well I think my question is somewhat away from my intention cause of my poor understanding and questioning :(
Actually, I have 1TB data and have hardware spec enough to handle this amount of data, but the problem is that it needs too many join operations and the analysis process is going too slow right now.
I've searched and found that graph model nicely fits for network data like social data in query performance.
If your data is hierarchal, then storing it in a network database is perfectly reasonable. I'm not sure, though, that there are many network databases for Linux. Raima is the only one I can think of.
Should I change my DB (I mean my DB for analysis)? or do I need some other solutions or any extension?
Thanks
--
Angular momentum makes the world go 'round.
Hi Pierre,
On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote:
Here is the doc, the sentence that I find misleading is "There are
backups that cannot be used for point-in-time recovery", also
mentioning that they are faster than pg_dumps add to confusion (since
pg_dumps cannot be used for PITR):
https://www.postgresql.org/docs/current/static/continuous-archiving.html
Yes, it is indeed perfectly possible to use such backups to do a PITR
as long as you have a WAL archive able to replay up to the point where
you want the replay to happen, so I agree that this is a bit confusing.
This part of the documentation is here since the beginning of times,
well 6559c4a2 to be exact. Perhaps we would want to reword this
sentence as follows:
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first."
I am open to better suggestions of course.
--
Michael
Hi Michael
Thanks for the confirmation. Your rewording removes the confusion. I would maybe take the opportunity to re-instate that pg_dump cannot be used for PITR, so in the line of
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first. Consider also that pg_dump backups cannot be used for point-in-time recovery."
Maybe the confusion stems from the fact that if you restore a standalone (self-contained) pg_basebackup then - by default - recovery is done with the recovery_target immediate option, so if one needs point-in-time recovery he has to edit the recovery.conf and brings the archives..
Thanks and regards,
Pierre
On Wednesday, June 20, 2018, 5:38:56 AM GMT+2, Michael Paquier <michael@paquier.xyz> wrote:
Hi Pierre,
On Tue, Jun 19, 2018 at 12:03:58PM +0000, Pierre Timmermans wrote:
Here is the doc, the sentence that I find misleading is "There are
backups that cannot be used for point-in-time recovery", also
mentioning that they are faster than pg_dumps add to confusion (since
pg_dumps cannot be used for PITR):
https://www.postgresql.org/docs/current/static/continuous-archiving.html
Yes, it is indeed perfectly possible to use such backups to do a PITR
as long as you have a WAL archive able to replay up to the point where
you want the replay to happen, so I agree that this is a bit confusing.
This part of the documentation is here since the beginning of times,
well 6559c4a2 to be exact. Perhaps we would want to reword this
sentence as follows:
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first."
I am open to better suggestions of course.
--
Michael
Data Ace schrieb am 15.06.2018 um 18:26:
Well I think my question is somewhat away from my intention cause of
my poor understanding and questioning :(Actually, I have 1TB data and have hardware spec enough to handle
this amount of data, but the problem is that it needs too many join
operations and the analysis process is going too slow right now.I've searched and found that graph model nicely fits for network data
like social data in query performance.Should I change my DB (I mean my DB for analysis)? or do I need some
other solutions or any extension?
AgensGraph is a Postgres fork implemententing a graph database supporting
Cypher as the query language while at the same time still supporting SQL
(and even queries mixing both)
I have never used it, but maybe it's worth a try.
http://bitnine.net/agensgraph/
Thomas
Hi Pierre,
On Wed, Jun 20, 2018 at 08:06:31AM +0000, Pierre Timmermans wrote:
Hi Michael
You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)
Thanks for the confirmation. Your rewording removes the confusion. I
would maybe take the opportunity to re-instate that pg_dump cannot be
used for PITR, so in the line of
"These are backups that could be used for point-in-time recovery if
combined with a WAL archive able to recover up to the wanted recovery
point. These backups are typically much faster to backup and restore
than pg_dump for large deployments but can result as well in larger
backup sizes, so the speed of one method or the other is to evaluate
carefully first. Consider also that pg_dump backups cannot be used for
point-in-time recovery."
Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.
Maybe the confusion stems from the fact that if you restore a
standalone (self-contained) pg_basebackup then - by default - recovery
is done with the recovery_target immediate option, so if one needs
point-in-time recovery he has to edit the recovery.conf and brings the
archives..
Perhaps. There is really nothing preventing one to add a recovery.conf
afterwards, which is also why pg_basebackup -R exists. I do that as
well for some of the framework I work with and maintain.
--
Michael
Attachments:
pitr-docs.patchtext/x-diff; charset=us-asciiDownload+9-6
On 06/21/2018 12:27 AM, Michael Paquier wrote:
[snip]
Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.
You've *got* to be kidding.
Fixing an ambiguously or poorly worded bit of *documentation* should
obviously be pushed to all affected versions.
--
Angular momentum makes the world go 'round.
Hi Michael
On Thursday, June 21, 2018, 7:28:13 AM GMT+2, Michael Paquier <michael@paquier.xyz> wrote:
You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)
Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself
Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.
Yes I think it is now perfectly clear. Much appreciated to have the chance to contribute to the doc by the way, it is very nice
Perhaps. There is really nothing preventing one to add a recovery.conf
afterwards, which is also why pg_basebackup -R exists. I do that as
well for some of the framework I work with and maintain.
I just went to the doc to check about this -R option :-)
Pierre
You should avoid top-posting on the Postgres lists, this is not the
usual style used by people around :)Will do, but Yahoo Mail! does not seem to like that, so I am typing the > myself
Same here even though I use Mac mail. But it is not yahoo alone.
Most of the web email clients have resorted to top posting. I miss the old
days of Outlook Express which was so '>' friendly. I think Gmail allows
'>' when you click on the dots to expand the mail you are replying to, but it messes
up in justifying and formatting it.
The best for '>': Unix elm :-)
On 21/06/18 07:27, Michael Paquier wrote:
Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.
Say what? If the clarification applies to previous versions, as it
does, it should be backpatched. This isn't a change in behavior, it's a
change in the description of existing behavior.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Thu, Jun 21, 2018 at 4:26 PM, Vik Fearing <vik.fearing@2ndquadrant.com>
wrote:
On 21/06/18 07:27, Michael Paquier wrote:
Attached is a patch which includes your suggestion. What do you think?
As that's an improvement, only HEAD would get that clarification.Say what? If the clarification applies to previous versions, as it
does, it should be backpatched. This isn't a change in behavior, it's a
change in the description of existing behavior.
Generally only actual bug fixes get back-patched; but I'd have to say this
looks like it could easily be classified as one.
Before: These are backups that cannot be used for PITR
After: These are backups that could be used for PITR if ...
Changing a cannot to a can seems like we are fixing a bug in the
documentation.
Some comments on the patch itself:
"recover up to the wanted recovery point." - "desired recovery point" reads
better to me
====
"These backups are typically much faster to backup and restore" - "These
backups are typically much faster to create and restore"; avoid repeated
use of the word backup
"but can result as well in larger backup sizes" - "but can result in larger
backup sizes", drop the unnecessary 'as well'
"sizes, so the speed of one method or the other is to evaluate carefully
first" - that is just wrong as-is; suggest just removing it.
====
To cover the last three items as a whole I'd suggest:
"These backups are typically much faster to create and restore, but
generate larger file sizes, compared to pg_dump."
For the last sentence I'd suggest:
"Note that because WAL cannot be applied on top of a restored pg_dump
backup it is considered a cold backup and cannot be used for
point-in-time-recovery."
I like adding "cold backup" here to help contrast and explain why a base
backup is considered a "hot backup". The rest is style to make that flow
better.
David J.
On Thu, Jun 21, 2018 at 04:42:00PM -0400, Ravi Krishna wrote:
Same here even though I use Mac mail. But it is not yahoo alone.
Most of the web email clients have resorted to top posting. I miss
the old days of Outlook Express which was so '>' friendly. I think
Gmail allows '>' when you click on the dots to expand the mail you
are replying to, but it messes up in justifying and formatting it.
Those products have good practices when it comes to break and redefine
what the concept behind emails is...
--
Michael
On Thu, Jun 21, 2018 at 04:50:38PM -0700, David G. Johnston wrote:
Generally only actual bug fixes get back-patched; but I'd have to say
this looks like it could easily be classified as one.
Everybody is against me here ;)
Some comments on the patch itself:
"recover up to the wanted recovery point." - "desired recovery point" reads
better to me====
"These backups are typically much faster to backup and restore" - "These
backups are typically much faster to create and restore"; avoid repeated
use of the word backup
Okay.
"but can result as well in larger backup sizes" - "but can result in larger
backup sizes", drop the unnecessary 'as well'
Okay.
I like adding "cold backup" here to help contrast and explain why a base
backup is considered a "hot backup". The rest is style to make that flow
better.
Indeed. The section uses hot backups a lot.
What do all folks here think about the updated attached?
--
Michael