Performance of ByteA: ascii vs binary

Started by Thomas Güttlerabout 7 years ago19 messagesgeneral
Jump to latest
#1Thomas Güttler
guettliml@thomas-guettler.de

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Thomas Güttler (#1)
Re: Performance of ByteA: ascii vs binary

On 3/18/19 7:33 AM, Thomas Güttler wrote:

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

What was the test?

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

Regards,
  Thomas Güttler

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adrian Klaver (#2)
Re: Performance of ByteA: ascii vs binary

Adrian Klaver <adrian.klaver@aklaver.com> writes:

On 3/18/19 7:33 AM, Thomas Güttler wrote:

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

This doesn't seem terribly surprising in bytea_output = escape
mode. Probably with bytea_output = hex the performance would
be less data-dependent.

regards, tom lane

#4Thomas Güttler
guettliml@thomas-guettler.de
In reply to: Tom Lane (#3)
minio server. Was: Performance of ByteA: ascii vs binary

Am 18.03.19 um 15:49 schrieb Tom Lane:

Adrian Klaver <adrian.klaver@aklaver.com> writes:

On 3/18/19 7:33 AM, Thomas Güttler wrote:

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

This doesn't seem terribly surprising in bytea_output = escape
mode. Probably with bytea_output = hex the performance would
be less data-dependent.

Thank you for your fast reply.

I guess we will use a minio server to store the blobs.

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Thomas Güttler (#4)
Re: minio server. Was: Performance of ByteA: ascii vs binary

On 3/18/19 7:59 AM, Thomas Güttler wrote:

Am 18.03.19 um 15:49 schrieb Tom Lane:

Adrian Klaver <adrian.klaver@aklaver.com> writes:

On 3/18/19 7:33 AM, Thomas Güttler wrote:

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

This doesn't seem terribly surprising in bytea_output = escape
mode.  Probably with bytea_output = hex the performance would
be less data-dependent.

Thank you for your fast reply.

I guess we will use a minio server to store the blobs.

According to this:

http://initd.org/psycopg/docs/usage.html?highlight=bytea#adapt-binary

psycopg2 will use 'hex' by default.

Have you tried the test outside Python/psycopg2 e.g in psql to see if
the performance hit still exists?

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Thomas Güttler
guettliml@thomas-guettler.de
In reply to: Adrian Klaver (#5)
Re: minio server. Was: Performance of ByteA: ascii vs binary

Am 18.03.19 um 17:52 schrieb Adrian Klaver:

On 3/18/19 7:59 AM, Thomas Güttler wrote:

Am 18.03.19 um 15:49 schrieb Tom Lane:

Adrian Klaver <adrian.klaver@aklaver.com> writes:

On 3/18/19 7:33 AM, Thomas Güttler wrote:

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

This doesn't seem terribly surprising in bytea_output = escape
mode.  Probably with bytea_output = hex the performance would
be less data-dependent.

Thank you for your fast reply.

I guess we will use a minio server to store the blobs.

According to this:

http://initd.org/psycopg/docs/usage.html?highlight=bytea#adapt-binary

psycopg2 will use 'hex' by default.

Have you tried the test outside Python/psycopg2 e.g in psql to see if the performance hit still exists?

No, I did not test it outside Python/psycopg2.

Since our code will use Python/psycopg2 I did not run a test outside.

The decision was made. Now I am busy learning minio. I use it for the first time.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#7Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Thomas Güttler (#1)
Re: Performance of ByteA: ascii vs binary

On 2019-03-18 15:33:17 +0100, Thomas Güttler wrote:

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
9.5).

I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
restrict the byte values to printable ASCII (32 .. 126). There was
absolutely no difference, as the attached graph shows.

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

Attachments:

ascii-binary.pngimage/pngDownload
#8Thomas Güttler
guettliml@thomas-guettler.de
In reply to: Peter J. Holzer (#7)
Re: Performance of ByteA: ascii vs binary

Am 19.03.19 um 20:37 schrieb Peter J. Holzer:

On 2019-03-18 15:33:17 +0100, Thomas G�ttler wrote:

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
9.5).

I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
restrict the byte values to printable ASCII (32 .. 126). There was
absolutely no difference, as the attached graph shows.

Strange. I saw a big difference.
What did you test?
I tested inserts.

Regards,
Thomas G�ttler

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#9Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Thomas Güttler (#8)
Re: Performance of ByteA: ascii vs binary

On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:

Am 19.03.19 um 20:37 schrieb Peter J. Holzer:

On 2019-03-18 15:33:17 +0100, Thomas Güttler wrote:

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

I don't see this here (Debian 9, Python 3,5, psycopg2 2.7.6, PostgreSQL
9.5).

I modified my bench-bytea script (https://github.com/hjp/blob-bench) to
restrict the byte values to printable ASCII (32 .. 126). There was
absolutely no difference, as the attached graph shows.

Strange. I saw a big difference.
What did you test?
I tested inserts.

The graph with the quantiles was for selects. For inserts I made only a
scatterplot and there didn't seem to be any difference either. I'll
check the quantiles, too, but I expect to see at most a small
difference. A large difference as the one you saw would show up clearly
in a scatterplot, too.

My test ist in github (see URL above). Do see the same difference with
my test on your system?

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter J. Holzer (#9)
Re: Performance of ByteA: ascii vs binary

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:

On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:

Strange. I saw a big difference.
What did you test?
I tested inserts.

The graph with the quantiles was for selects.

Hmm, so there are two different code paths being considered here
-- the OP is apparently talking about the speed of bytea_in,
while the rest of us have been thinking about bytea_out.

The question about whether the text representation is hex or
"escape" style still applies, though.

regards, tom lane

#11Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom Lane (#10)
Re: Performance of ByteA: ascii vs binary

On 3/21/19 6:49 AM, Tom Lane wrote:

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:

On 2019-03-20 13:20:57 +0100, Thomas Güttler wrote:

Strange. I saw a big difference.
What did you test?
I tested inserts.

The graph with the quantiles was for selects.

Hmm, so there are two different code paths being considered here
-- the OP is apparently talking about the speed of bytea_in,
while the rest of us have been thinking about bytea_out.

The question about whether the text representation is hex or
"escape" style still applies, though.

Yes, it would be nice to see the test code.

regards, tom lane

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Thomas Güttler
guettliml@thomas-guettler.de
In reply to: Thomas Güttler (#1)
Script which shows performance of ByteA: ascii vs binary

Thank you for asking several times for a benchmark.

I wrote it now and it is visible: inserting random bytes into bytea is much slower,
if you use the psycopg2 defaults.

Here is the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png

And here is the script which creates the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

Please let me know if there is something wrong or could get improved.

Regards,
Thomas Güttler

Am 18.03.19 um 15:33 schrieb Thomas Güttler:

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

Regards,
  Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#13Francisco Olarte
folarte@peoplecall.com
In reply to: Thomas Güttler (#12)
Re: Script which shows performance of ByteA: ascii vs binary

Thomas:

On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

Thank you for asking several times for a benchmark.
I wrote it now and it is visible: inserting random bytes into bytea is much slower,
if you use the psycopg2 defaults.
Here is the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
And here is the script which creates the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,
are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

Francisco Olarte.

In reply to: Francisco Olarte (#13)
Re: Script which shows performance of ByteA: ascii vs binary

On Fri, Mar 22, 2019 at 01:40:28PM +0100, Francisco Olarte wrote:

Thomas:

On Fri, Mar 22, 2019 at 11:22 AM Thomas G�ttler
<guettliml@thomas-guettler.de> wrote:

Thank you for asking several times for a benchmark.
I wrote it now and it is visible: inserting random bytes into bytea is much slower,
if you use the psycopg2 defaults.
Here is the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
And here is the script which creates the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,
are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

Francisco Olarte.

+1

Regards,
Ken

#15Thomas Güttler
guettliml@thomas-guettler.de
In reply to: Francisco Olarte (#13)
Re: Script which shows performance of ByteA: ascii vs binary

Am 22.03.19 um 13:40 schrieb Francisco Olarte:

Thomas:

On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

Thank you for asking several times for a benchmark.
I wrote it now and it is visible: inserting random bytes into bytea is much slower,
if you use the psycopg2 defaults.
Here is the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
And here is the script which creates the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,
are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

+1

for this case toast-tables/wal is a detail of the implementation.
This tests does not care about the "why it takes longer". It just generates
a performance chart.

Yes, it does exactly what you say: it compares
nearly not compressible urandom data against a highly compressible data.

In my case, will get nearly random data (binary PDF, JPG, ...). And that's why
I wanted to benchmark it.

Regards,
Thomas

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

#16Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Thomas Güttler (#15)
Re: Script which shows performance of ByteA: ascii vs binary

On 3/22/19 6:04 AM, Thomas Güttler wrote:

Am 22.03.19 um 13:40 schrieb Francisco Olarte:

Thomas:

On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

Thank you for asking several times for a benchmark.
I wrote it now and it is visible: inserting random bytes into bytea
is much slower,
if you use the psycopg2 defaults.
Here is the chart:

https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png

And here is the script which creates the chart:

https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,
are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

+1

for this case toast-tables/wal is a detail of the implementation.
This tests does not care about the "why it takes longer". It just generates
a performance chart.

TOAST is tunable, might want to take a look at:

https://www.postgresql.org/docs/11/storage-toast.html

Yes, it does exactly what you say: it compares
nearly not compressible urandom data against a highly compressible data.

In my case, will get nearly random data (binary PDF, JPG, ...). And
that's why
I wanted to benchmark it.

Regards,
  Thomas

--
Adrian Klaver
adrian.klaver@aklaver.com

#17Francisco Olarte
folarte@peoplecall.com
In reply to: Thomas Güttler (#15)
Re: Script which shows performance of ByteA: ascii vs binary

Thomas:

On Fri, Mar 22, 2019 at 2:03 PM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (

...

for this case toast-tables/wal is a detail of the implementation.
This tests does not care about the "why it takes longer". It just generates
a performance chart.
Yes, it does exactly what you say: it compares
nearly not compressible urandom data against a highly compressible data.
In my case, will get nearly random data (binary PDF, JPG, ...). And that's why
I wanted to benchmark it.

Well, if all you wanted is to benchmark, a performance chart, you know
it. I assumed you wanted to know more, like where the bottleneck is
and how to try to avoid it. My fault.

I was specifically confused because, IIRC, you said ascii data took
much longer than binary, which is a completely different test ( you
can test ascii vs binary by generating
chunks of random numbers in say, the 0 to 127 and sending that data
once as is and once more shifting left by one every byte. That should
test ascii-binary differences, but
if you test random vs uniform I think the problem is in the
randomness, you could just test sending '\x92'*i, or something
similar, but if you are just interested in the benchmark, you have it
done.

Just one thing, a single graph with two labels "ascii" and "random" is
misleading, as constant/random is orthogonal with ascii/binary, but as
I said, my fault, all the text about psycopg and other stuff led me to
think you wanted some kind of diagnosis and improvements.

Regards.

#18Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Francisco Olarte (#13)
Re: Script which shows performance of ByteA: ascii vs binary

On 2019-03-22 13:40:28 +0100, Francisco Olarte wrote:

On Fri, Mar 22, 2019 at 11:22 AM Thomas Güttler
<guettliml@thomas-guettler.de> wrote:

Thank you for asking several times for a benchmark.
I wrote it now and it is visible: inserting random bytes into bytea
is much slower, if you use the psycopg2 defaults.
Here is the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.png
And here is the script which creates the chart:
https://github.com/guettli/misc/blob/master/bench-bytea-inserts-postrgres.py

I'm not too sure, but I read ( in the code ) you are measuring a
nearly not compressible urandom data againtst a highly compressible (
'x'*i ) data,

Yes, that seems to be the main difference. My "ascii" test creates
random data in the range [32, 126], which is not very compressible, and
I didn't see much of a difference to the full range (10th percentile and
median were the same, 90th percentile was noticeably better). If I
create "random" data in the range [120, 120], I also get a large speedup
(about 3.5 times). Interestingly the difference vanishes for large (> 10
MB) blobs.

are you sure the difference is not due to data being compressed and
generating much less disk usage in toast-tables/wal?

Yes, I think that's it: He is basically measuring how fast his CPU
can compress a stream of constant bytes. Not very meaningful.

Another difference I noticed between our benchmarks is that I used a
plain bytes object while he used a psycopg2.Binary object. Those might
be serialized differently, but since the speed difference is adequately
explained by the (lack of) randomness, I am not going to investigate
this.

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/&gt;

In reply to: Thomas Güttler (#1)
Re: Performance of ByteA: ascii vs binary

Just theoretically assumption. PostgreSQL sometimes may optimise internal format of data and can sometimes zip data. ASCII data can be zipped better, then binary random data. Also PostgreSQL sometimes take decision to keep a column in an external file, if the column is still too big after zip. I don’t know what exactly happens in your case, but here can be a reason.

Show quoted text

18 марта 2019 г., в 17:33, Thomas Güttler <guettliml@thomas-guettler.de> написал(а):

I did some benchmarking and in my setup there was major
performance difference.

I tested a ByteA column.

If I used ascii data the tests took 52 seconds.
If I used random binary data the test took 250 seconds.

binary data is (roughly) five times slower than ascii data?

Is this a know fact, or is there something wrong with my benchmark?

I used Python and psycopg2.

Regards,
Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines