PG vs ElasticSearch for Logs

Started by Thomas Güttlerover 9 years ago21 messagesgeneral

Jump to latest

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John R Pierce

pierce@hogranch.com

over 9 years ago

In reply to: Thomas Güttler (#1)

Re: PG vs ElasticSearch for Logs

On 8/19/2016 12:32 AM, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for
adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon
you're talking big tables.

in fact thats several rows/second on a 24/7 basis

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: John R Pierce (#2)

Re: PG vs ElasticSearch for Logs

Am 19.08.2016 um 09:42 schrieb John R Pierce:

On 8/19/2016 12:32 AM, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

And queries including json data are not possible with grep (or at least very hard to type)

My concern is which DB (or indexing) to use ...

Regards,
Thomas

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sameer Kumar

sameer.kumar@ashnik.com

over 9 years ago

In reply to: Thomas Güttler (#3)

Re: PG vs ElasticSearch for Logs

On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de>
wrote:

Am 19.08.2016 um 09:42 schrieb John R Pierce:

On 8/19/2016 12:32 AM, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for

adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon

you're talking big tables.

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

And queries including json data are not possible with grep (or at least
very hard to type)

My concern is which DB (or indexing) to use ...

How will you be using the logs? What kind of queries? What kind of
searches?
Correlating events and logs from various sources could be really easy with
joins, count and summary operations.

The kind of volume you are anticipating should be fine with Postgres but
before you really decide which one, you need to figure out what would you
want to do with this data once it is in Postgres.

Regards,
Thomas

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
--
Best Regards
Sameer Kumar | DB Solution Architect
*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Sameer Kumar (#4)

Re: PG vs ElasticSearch for Logs

Am 19.08.2016 um 11:21 schrieb Sameer Kumar:

On Fri, Aug 19, 2016 at 4:58 PM Thomas Güttler <guettliml@thomas-guettler.de <mailto:guettliml@thomas-guettler.de>> wrote:

Am 19.08.2016 um 09:42 schrieb John R Pierce:

On 8/19/2016 12:32 AM, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

And queries including json data are not possible with grep (or at least very hard to type)

My concern is which DB (or indexing) to use ...

How will you be using the logs? What kind of queries? What kind of searches?
Correlating events and logs from various sources could be really easy with joins, count and summary operations.

Wishes raise with possibilities. First I want to do simple queries about hosts and timestamps. Then some simple
substring matches.

Up to now to structured logging (the json column) gets created. But if it gets filled, we will find a use case where
we use ssh+grep up to now.

Up to now we need no stemming and language support.

The kind of volume you are anticipating should be fine with Postgres but before you really decide which one, you need to
figure out what would you want to do with this data once it is in Postgres.

The goal is a bit fuzzy up to now: Better overview.

Thank you for your feedback ("The kind of volume you are anticipating should be fine with Postgres").

I guess I will use postgres, especial since Django ORM supports JSON in Postgres:

https://docs.djangoproject.com/en/1.10/ref/contrib/postgres/fields/#jsonfield

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Rafal Pietrak

rafal@ztk-rp.eu

over 9 years ago

In reply to: Thomas Güttler (#3)

Re: PG vs ElasticSearch for Logs

W dniu 19.08.2016 o 10:57, Thomas Güttler pisze:

Am 19.08.2016 um 09:42 schrieb John R Pierce:

[-------------]

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

I think indexing in postgres is much faster than grep.

Not so much IMHE(xperience).

1. if you know what you are looking for: grep's the best

2. if you dont .... grep (or more/less/vi) is the best.

only when you routinely update/join/etc, RDBMS really shines.But that's
not what you normally do with your logs. Right?

but then again. there is an additional benefit of "having everyting
under one hood" - so standarising on a single repository (like rdbms)
has its benefits.

regards,

-R

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andreas Kretschmer

akretschmer@spamfence.net

over 9 years ago

In reply to: Thomas Güttler (#5)

Re: PG vs ElasticSearch for Logs

Thomas Gï¿½ttler <guettliml@thomas-guettler.de> wrote:

How will you be using the logs? What kind of queries? What kind of searches?
Correlating events and logs from various sources could be really easy with joins, count and summary operations.

Wishes raise with possibilities. First I want to do simple queries about
hosts and timestamps. Then some simple substring matches.

for append-only tables like this consider 9.5 and BRIN-Indexes for
timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
properly because of vacuum and re-use of space within the table.
Do you know BRIN?

So, in your case, consider partitioning, maybe per month. So you can
also avoid mess with table and index bloat.

Greetings from Dresden to Chemnitz (is this still valid?)

Regards, Andreas Kretschmer
--
Andreas Kretschmer
http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Francisco Olarte

folarte@peoplecall.com

over 9 years ago

In reply to: Andreas Kretschmer (#7)

Re: PG vs ElasticSearch for Logs

On Fri, Aug 19, 2016 at 12:44 PM, Andreas Kretschmer
<akretschmer@spamfence.net> wrote:

for append-only tables like this consider 9.5 and BRIN-Indexes for
timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
properly because of vacuum and re-use of space within the table.
Do you know BRIN?

So, in your case, consider partitioning, maybe per month. So you can
also avoid mess with table and index bloat.

If done properly he can use both. For 6 weeks I would use seven
partition, current+6 previous, drop old partition weekly, so
effectively they become append only and he can use BRIN too.

Even better, if he normally inserts in batches ( it happens to me with
some log-like data, I rotate the file and insert all rotated data
periodically ) he can use a staging table ( 1 master, and inheriting
from it seven constrained week partition plus one unconstrained
staging partition). Insertions go into staging and are moved with a
small delay to the corresponding partition, using and ordered select
so they go in perfect order into their final resting place and it can
be vacuumed just after that ( if they are log lines and the maximum
delay is X you just move every row older than that from staging to the
partition with whatever period is best). Staging partition is normally
small and cached and can be processed quite fast ( with 200k/day an
hourly movement will leave staging with less than about 10k rows if
distribution is somehow uniform ).

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Andreas Kretschmer (#7)

Re: PG vs ElasticSearch for Logs

Am 19.08.2016 um 12:44 schrieb Andreas Kretschmer:

Thomas Gï¿½ttler <guettliml@thomas-guettler.de> wrote:

How will you be using the logs? What kind of queries? What kind of searches?
Correlating events and logs from various sources could be really easy with joins, count and summary operations.

Wishes raise with possibilities. First I want to do simple queries about
hosts and timestamps. Then some simple substring matches.

for append-only tables like this consider 9.5 and BRIN-Indexes for
timestamp-searches. But if you deletes after N weeks BRIN shouldn't work
properly because of vacuum and re-use of space within the table.
Do you know BRIN?

So, in your case, consider partitioning, maybe per month. So you can
also avoid mess with table and index bloat.

Thank you very much for these hints. I did not know BRIN before.

Greetings from Dresden to Chemnitz (is this still valid?)

Yes, I am in Chemnitz/Germany. Everything (kids, wife, friends, sports, job) is fine.

I hope the same with you?

Regards,
Thomas Gï¿½ttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

John R Pierce

pierce@hogranch.com

over 9 years ago

In reply to: Andreas Kretschmer (#7)

Re: PG vs ElasticSearch for Logs

On 8/19/2016 3:44 AM, Andreas Kretschmer wrote:

So, in your case, consider partitioning, maybe per month. So you can
also avoid mess with table and index bloat.

with his 6 week retention, i'd partition by week.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

Merlin Moncure

mmoncure@gmail.com

over 9 years ago

In reply to: Thomas Güttler (#1)

Re: PG vs ElasticSearch for Logs

On Fri, Aug 19, 2016 at 2:32 AM, Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We use SOLR (which is similar to ElasticSearch) here for json document
retrieval. Agreeing to do this was one of the biggest mistakes in my
professional career. This choice was somewhat forced because at the
time jsonb was not baked. In my opinion, jsonb outclasses these types
of services particularly if you are already invested in postgres. The
specifics of your requirements also plays into this decision
naturally. The bottom line though is that these kinds of systems are
not nearly as fast or robust as they claim to be particularly if you
wander off the use cases they are engineered for (like needing
transactions or joins for example). They also tend to be fairly
opaque in how they operate and the supporting tooling is laughable
relative to established database systems.

Postgres OTOH can be made to do pretty much anything given sufficient
expertise and a progressive attitude.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

Andy Colson

andy@squeakycode.net

over 9 years ago

In reply to: Thomas Güttler (#1)

Re: PG vs ElasticSearch for Logs

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton of logs into it, and
made a pretty dashboard ... but in the end it didn't really help me, and
wasn't that useful. My problem is, I don't want to have to go look at
it. If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly summary, which the
website turns around and queries the stats to show pretty usage graphs.

In the end, PG or ES, all depends on what you want.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

Chris Mair

chris@1006.org

over 9 years ago

In reply to: Thomas Güttler (#3)

Re: PG vs ElasticSearch for Logs

On 19/08/16 10:57, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

Hi,

to me this kind of data looks like something Postgres can handle with ease.

We're talking about 8.4M rows here.

Coincidentally, I was trying out the new parallel query feature in the
9.6 beta just now and decided to use your numbers as a test case :)

I can create 8.4M records having a timestamp and a random ~ 250 character string
in ~ 31 seconds:

pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts,
pg96-# repeat(random()::text, 15) as msg
pg96-# into t1
pg96-# from generate_series(1, 6 * 7 * 200000) as sec;
SELECT 8400000
Time: 30858.274 ms

Table size is 2.4 GB.

This gives about 6 weeks. A query to scan the whole thing on the narrow column
takes ~ 400 msec, like this:

pg96=# select min(ts), max(ts) from t1;
min | max
-------------------------------+-------------------------------
2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00
(1 row)

Time: 409.468 ms

Even running an unanchored regular expression (!) on the wider column is doable:

pg96=# select count(*) from t1 where msg ~ '12345';
count
-------
955
(1 row)

Time: 3146.838 ms

If you have some filter, not everything needs to be regexped and this gets pretty fast:

pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345';
count
-------
24
(1 row)

Time: 391.577 ms

All this is without indices. Your data is more structured than my test, so undoubtly you will
get some gain from indices...

Here is something more analytical - basically same as the count(*) above:

pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date;
ts | count
------------+-------
2016-08-19 | 26
2016-08-20 | 28
[...]
2016-09-28 | 21
2016-09-29 | 33
(42 rows)

Time: 3157.010 ms

Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is
run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and
the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost
like cheating ;)

Bye,
Chris.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14

Sameer Kumar

sameer.kumar@ashnik.com

over 9 years ago

In reply to: Andy Colson (#12)

Re: PG vs ElasticSearch for Logs

On Sat, 20 Aug 2016, 2:00 a.m. Andy Colson, <andy@squeakycode.net> wrote:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton of logs into it, and
made a pretty dashboard ... but in the end it didn't really help me, and
wasn't that useful. My problem is, I don't want to have to go look at
it. If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

There are tools from Elastic Stack which could have helped you achieve
email alerts and gather top or tailing of logfile

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly summary, which the
website turns around and queries the stats to show pretty usage graphs.

In the end, PG or ES, all depends on what you want.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
--
Best Regards
Sameer Kumar | DB Solution Architect
*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

#15

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Chris Mair (#13)

Re: PG vs ElasticSearch for Logs

Thank you Chris for looking at my issue in such detail.

Yes, the parallel feature rocks.

Regards,
Thomas Güttler

Am 19.08.2016 um 22:40 schrieb Chris Mair:

On 19/08/16 10:57, Thomas Güttler wrote:

What do you think?

I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying.

200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables.

in fact thats several rows/second on a 24/7 basis

There is no need to store them more then 6 weeks in my current use case.

Hi,

to me this kind of data looks like something Postgres can handle with ease.

We're talking about 8.4M rows here.

Coincidentally, I was trying out the new parallel query feature in the
9.6 beta just now and decided to use your numbers as a test case :)

I can create 8.4M records having a timestamp and a random ~ 250 character string
in ~ 31 seconds:

pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts,
pg96-# repeat(random()::text, 15) as msg
pg96-# into t1
pg96-# from generate_series(1, 6 * 7 * 200000) as sec;
SELECT 8400000
Time: 30858.274 ms

Table size is 2.4 GB.

This gives about 6 weeks. A query to scan the whole thing on the narrow column
takes ~ 400 msec, like this:

pg96=# select min(ts), max(ts) from t1;
min | max
-------------------------------+-------------------------------
2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00
(1 row)

Time: 409.468 ms

Even running an unanchored regular expression (!) on the wider column is doable:

pg96=# select count(*) from t1 where msg ~ '12345';
count
-------
955
(1 row)

Time: 3146.838 ms

If you have some filter, not everything needs to be regexped and this gets pretty fast:

pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345';
count
-------
24
(1 row)

Time: 391.577 ms

All this is without indices. Your data is more structured than my test, so undoubtly you will
get some gain from indices...

Here is something more analytical - basically same as the count(*) above:

pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date;
ts | count
------------+-------
2016-08-19 | 26
2016-08-20 | 28
[...]
2016-09-28 | 21
2016-09-29 | 33
(42 rows)

Time: 3157.010 ms

Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is
run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and
the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost
like cheating ;)

Bye,
Chris.

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#16

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Andy Colson (#12)

Re: PG vs ElasticSearch for Logs

Am 19.08.2016 um 19:59 schrieb Andy Colson:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use Kibana which looks really pretty. I dumped a ton
of logs into it, and made a pretty dashboard ... but in the end it didn't really help me, and wasn't that useful. My
problem is, I don't want to have to go look at it. If something goes bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is different than I'm used to. It made it hard to
find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off updates into PG to keep stats. But its an hourly
summary, which the website turns around and queries the stats to show pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not receive http requests.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17

Sameer Kumar

sameer.kumar@ashnik.com

over 9 years ago

In reply to: Thomas Güttler (#16)

Re: PG vs ElasticSearch for Logs

On Mon, 22 Aug 2016, 3:40 p.m. Thomas Güttler, <guettliml@thomas-guettler.de>
wrote:

Am 19.08.2016 um 19:59 schrieb Andy Colson:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use

Kibana which looks really pretty. I dumped a ton

of logs into it, and made a pretty dashboard ... but in the end it

didn't really help me, and wasn't that useful. My

problem is, I don't want to have to go look at it. If something goes

bad, then I want an email alert, at which point

I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is

different than I'm used to. It made it hard to

find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off

updates into PG to keep stats. But its an hourly

summary, which the website turns around and queries the stats to show

pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix are
not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide
spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not receive
http requests.

You may decide not to use Elasticsearch but take a look at other components
of Elastic Stack like logstash and beats. They can be helpful even when you
use Postgres as the end point. Otherwise (IMHO), you would spend a lot of
time writing scripts and jobs to capture and stream logs. If I were you, I
would not want to do that.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
--
Best Regards
Sameer Kumar | DB Solution Architect
*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350

Skype: sameer.ashnik | www.ashnik.com

#18

Andy Colson

andy@squeakycode.net

over 9 years ago

In reply to: Thomas Güttler (#16)

Re: PG vs ElasticSearch for Logs

On 8/22/2016 2:39 AM, Thomas Güttler wrote:

Am 19.08.2016 um 19:59 schrieb Andy Colson:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton
of logs into it, and made a pretty dashboard ... but in the end it
didn't really help me, and wasn't that useful. My
problem is, I don't want to have to go look at it. If something goes
bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to
find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly
summary, which the website turns around and queries the stats to show
pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix
are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide
spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not
receive http requests.

Regards,
Thomas Güttler

I don't read the file. In apache.conf:

# v, countyia, ip, sess, ts, url, query, status
LogFormat
"3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\",\"%q\",%>s"
csv3

CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3

I think I read somewhere that if you pipe to a script (like above) and
you dont read fast enough, it could slow apache down. That's why the
script above dumps do redis first. That way I can move processes
around, restart the database, etc, etc, and not break apache in any way.

The important part of the script:

while (my $x = <>)
{
chomp($x);
next unless ($x);
try_again:
if ($redis)
{
eval {
$redis->lpush($qname, $x);
};
if ($@)
{
$redis = redis_connect();
goto try_again;
}
# just silence this one
eval {
$redis->ltrim($qname, 0, 1000);
};
}
}

Any other machine, or even multiple, then reads from redis and inserts
into PG.

You can see, in my script, I trim the queue to 1000 items, but that's
because I'm not as worried about loosing results. Your setup would
probably be different. I also setup redis to not save anything to disk,
again, because I don't mind if I loose a few hits here or there. But
you get the idea.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19

Terry Schmitt

terry.schmitt@gmail.com

over 9 years ago

In reply to: Andy Colson (#18)

Re: PG vs ElasticSearch for Logs

Certainly Postgres is capable of handling this volume just fine. Throw in
some partition rotation handling and you have a solution.
If you want to play with something different, check out Graylog, which is
backed by Elasticsearch. A bit more work to set up than a single Postgres
table, but it has ben a success for us storing, syslog, app logs, and
Postgres logs from several hundred network devices, Windows and Linux
servers. Rotation is handled based on your requirements and drilling down
to the details is trivial. Alerting is baked in as well. It could well be
overkill for your needs, but I don't know what your environment looks like.

On Mon, Aug 22, 2016 at 7:03 AM, Andy Colson <andy@squeakycode.net> wrote:

Show quoted text

On 8/22/2016 2:39 AM, Thomas Güttler wrote:

Am 19.08.2016 um 19:59 schrieb Andy Colson:

On 8/19/2016 2:32 AM, Thomas Güttler wrote:

I want to store logs in a simple table.

Here my columns:

Primary-key (auto generated)
timestamp
host
service-on-host
loglevel
msg
json (optional)

I am unsure which DB to choose: Postgres, ElasticSearch or ...?

We don't have high traffic. About 200k rows per day.

My heart beats for postgres. We use it since several years.

On the other hand, the sentence "Don't store logs in a DB" is
somewhere in my head.....

What do you think?

I played with ElasticSearch a little, mostly because I wanted to use
Kibana which looks really pretty. I dumped a ton
of logs into it, and made a pretty dashboard ... but in the end it
didn't really help me, and wasn't that useful. My
problem is, I don't want to have to go look at it. If something goes
bad, then I want an email alert, at which point
I'm going to go run top, and tail the logs.

Another problem I had with kibana/ES is the syntax to search stuff is
different than I'm used to. It made it hard to
find stuff in kibana.

Right now, I have a perl script that reads apache logs and fires off
updates into PG to keep stats. But its an hourly
summary, which the website turns around and queries the stats to show
pretty usage graphs.

You use Perl to read apache logs. Does this work?

Forwarding logs reliably is not easy. Logs are streams, files in unix
are not streams. Sooner or later
the files get rotated. RELP exists, but AFAIK it's usage is not wide
spread:

https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Let's see how to get the logs into postgres ....

In the end, PG or ES, all depends on what you want.

Most of my logs start from a http request. I want a unique id per request
in every log line which gets created. This way I can trace the request,
even if its impact spans to several hosts and systems which do not
receive http requests.

Regards,
Thomas Güttler

I don't read the file. In apache.conf:

# v, countyia, ip, sess, ts, url, query, status
LogFormat "3,%{countyName}e,%a,%{VCSID}C,%{%Y-%m-%dT%H:%M:%S%z}t,\"%U\",\"%q\",%>s"
csv3

CustomLog "|/usr/local/bin/statSender.pl -r 127.0.0.1" csv3

I think I read somewhere that if you pipe to a script (like above) and you
dont read fast enough, it could slow apache down. That's why the script
above dumps do redis first. That way I can move processes around, restart
the database, etc, etc, and not break apache in any way.

The important part of the script:

while (my $x = <>)
{
chomp($x);
next unless ($x);
try_again:
if ($redis)
{
eval {
$redis->lpush($qname, $x);
};
if ($@)
{
$redis = redis_connect();
goto try_again;
}
# just silence this one
eval {
$redis->ltrim($qname, 0, 1000);
};
}
}

Any other machine, or even multiple, then reads from redis and inserts
into PG.

You can see, in my script, I trim the queue to 1000 items, but that's
because I'm not as worried about loosing results. Your setup would
probably be different. I also setup redis to not save anything to disk,
again, because I don't mind if I loose a few hits here or there. But you
get the idea.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Sameer Kumar (#17)

ElasticSearch Beats

You may decide not to use Elasticsearch but take a look at other components of Elastic Stack like logstash and beats.
They can be helpful even when you use Postgres as the end point. Otherwise (IMHO), you would spend a lot of time writing
scripts and jobs to capture and stream logs. If I were you, I would not want to do that.

I wanted to know what Beats is. I found this:

{{{
Beats is the platform for building lightweight, open source data shippers for many types of data you want to enrich
with Logstash, search and analyze in Elasticsearch, and visualize in Kibana.
}}}

Oh what fun. How does "enrich data" work? Imagine you have N bytes. There
are several ways to create N+M bytes from N bytes. I call this bloating,
not enriching. You can analyze data, compress it, visualize it.... But AFAIK
you can't enrich data.

Reading the next pages of the advertisement:
{{{
enriched and combined with other data sources using Logstash
}}}

OK, this way data can be enriched by using other data sources.

Still I see no use case for beats.

Do you use it? How does it help you?

I guess I take the boring route. I never used it, but it looks solid:
https://en.wikipedia.org/wiki/Reliable_Event_Logging_Protocol

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21

Thomas Güttler

guettliml@thomas-guettler.de

over 9 years ago

In reply to: Terry Schmitt (#19)