[WIP] Zipfian distribution in pgbench
Hello!
PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking with big number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL. MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number generator, everyone will be able to make research on this topic without using YCSB.
This is the reason why I am currently working on random_zipfian function.
The bottleneck of algorithm that I use is that it calculates zeta function (it has linear complexity - https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause problems on generating huge amount of big numbers.
That’s why I added caching for zeta value. And it works good for cases when random_zipfian called with same parameters in script. For example:
…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 100, 1.2)
…
In other case, second call will override cache of first and caching does not make any sense:
…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 200, 1.4)
…
That’s why I have a question: should I implement support of caching zeta values for calls with different parameters, or not?
P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql
On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off):
nclients tps
1 8842.401870
2 18358.140869
4 45999.378785
8 88713.743199
16 170166.998212
32 290069.221493
64 178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
1000 6188.136736
Attachments:
pgbench-zipf-01v.patchapplication/octet-stream; name=pgbench-zipf-01v.patch; x-unix-mode=0644Download
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 64b043b48a..1dea6e4b17 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1014,6 +1014,14 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
<entry>an integer between <literal>1</> and <literal>10</></>
</row>
<row>
+ <entry><literal><function>random_zipfian(<replaceable>lb</>, <replaceable>ub</>, <replaceable>parameter</>)</></></>
+ <entry>integer</>
+ <entry> Zipfian-distributed random integer in <literal>[lb, ub]</>,
+ see below</>
+ <entry><literal>random_zipfian(1, 10, 1.2)</></>
+ <entry>an integer between <literal>1</> and <literal>10</></>
+ </row>
+ <row>
<entry><literal><function>sqrt(<replaceable>x</>)</></></>
<entry>double</>
<entry>square root</>
@@ -1094,6 +1102,24 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
of the Box-Muller transform.
</para>
</listitem>
+
+ <listitem>
+ <para>
+ For Zipfian distribution, <replaceable>parameter</>
+ defines how skewed the distribution. A physical sense of this parameter
+ is following: <literal>1</> generated <literal>N</> times,
+ <literal>2</> generated <literal>2 ^ parameter</> less,
+ <literal>3</> generated <literal>3 ^ parameter</> less, ...
+ <literal> X </> generated <literal>X ^ parameter</> less times.
+ </para>
+ <para>
+ Intuitively, the larger the <replaceable>parameter</>, the more
+ frequently value values to the beginning of the interval are drawn.
+ The closer to 0 <replaceable>parameter</> is,
+ the flatter (more uniform) the access distribution.
+
+ </para>
+ </listitem>
</itemizedlist>
<para>
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index b3a2d9bfd3..25d5ad48e5 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -191,6 +191,9 @@ static const struct
{
"random_exponential", 3, PGBENCH_RANDOM_EXPONENTIAL
},
+ {
+ "random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
+ },
/* keep as last array element */
{
NULL, 0, 0
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4d364a1db4..9e515abe9b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -94,6 +94,7 @@ static int pthread_join(pthread_t th, void **thread_return);
#define DEFAULT_NXACTS 10 /* default nxacts */
#define MIN_GAUSSIAN_PARAM 2.0 /* minimum parameter for gauss */
+#define MIN_ZIPFIAN_PARAM 1.0000001 /* minimum parameter for zipfian */
int nxacts = 0; /* number of transactions per client */
int duration = 0; /* duration in seconds */
@@ -184,6 +185,14 @@ char *dbName;
char *logfile_prefix = NULL;
const char *progname;
+/* Variables used in zipfian_random */
+double zipf_zetan; /* zeta(n) */
+double zipf_prev_theta; /* theta parameter of previous execution */
+double zipf_prev_nitems; /* n(ub - lb + 1) parameter of previous execution */
+double zipf_alpha;
+double zipf_beta;
+double zipf_eta;
+
#define WSEP '@' /* weight separator */
volatile bool timer_exceeded = false; /* flag from signal handler */
@@ -737,6 +746,58 @@ getPoissonRand(TState *thread, int64 center)
return (int64) (-log(uniform) * ((double) center) + 0.5);
}
+/* helper function for getZipfianRand */
+static double
+zeta(int64 n, double theta)
+{
+ int i;
+ double ans = 0.0;
+
+ for (i = 1; i <= n; i++)
+ ans += pow(1. / (double)i, theta);
+ return (ans);
+}
+
+/* helper function for getZipfianRand. Computes random variable in range 1..n */
+static int64
+zipfn(TState *thread, int64 n, double theta)
+{
+ double u = pg_erand48(thread->random_state);
+ double uz = u * zipf_zetan;
+
+ if (uz < 1.)
+ return 1;
+ if (uz < 1 + zipf_beta)
+ return 2;
+ return 1 + (int64)(n * pow(zipf_eta * u - zipf_eta + 1., zipf_alpha));
+}
+
+/* random number generator: zipfian distribution from min to max inclusive */
+static int64
+getZipfianRand(TState *thread, int64 min, int64 max, double theta)
+{
+ Assert(theta > MIN_ZIPFIAN_PARAM);
+
+ int64 n = max - min + 1;
+ double zipf_zeta2;
+
+ /* update cached variables if current parameters are different from previous */
+ if (n != zipf_prev_nitems || theta != zipf_prev_theta)
+ {
+ zipf_prev_theta = theta;
+ zipf_prev_nitems = n;
+
+ zipf_zeta2 = zeta(2, theta);
+ zipf_zetan = zeta(n, theta);
+
+ zipf_alpha = 1. / (1 - theta);
+ zipf_beta = pow(0.5, theta);
+ zipf_eta = (1. - pow(2. / n, 1 - theta)) / (1. - zipf_zeta2 / zipf_zetan);
+ }
+
+ return min + zipfn(thread, n, theta) - 1;
+}
+
/*
* Initialize the given SimpleStats struct to all zeroes
*/
@@ -1567,6 +1628,7 @@ evalFunc(TState *thread, CState *st,
case PGBENCH_RANDOM:
case PGBENCH_RANDOM_EXPONENTIAL:
case PGBENCH_RANDOM_GAUSSIAN:
+ case PGBENCH_RANDOM_ZIPFIAN:
{
int64 imin,
imax;
@@ -1617,6 +1679,18 @@ evalFunc(TState *thread, CState *st,
setIntValue(retval,
getGaussianRand(thread, imin, imax, param));
}
+ else if (func == PGBENCH_RANDOM_ZIPFIAN)
+ {
+ if (param <= MIN_ZIPFIAN_PARAM)
+ {
+ fprintf(stderr,
+ "zipfian parameter must be greater than %f "
+ "(not %f)\n", MIN_ZIPFIAN_PARAM, param);
+ return false;
+ }
+ setIntValue(retval,
+ getZipfianRand(thread, imin, imax, param));
+ }
else /* exponential */
{
if (param <= 0.0)
diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h
index abc13e9463..1a29f1260c 100644
--- a/src/bin/pgbench/pgbench.h
+++ b/src/bin/pgbench/pgbench.h
@@ -75,7 +75,8 @@ typedef enum PgBenchFunction
PGBENCH_SQRT,
PGBENCH_RANDOM,
PGBENCH_RANDOM_GAUSSIAN,
- PGBENCH_RANDOM_EXPONENTIAL
+ PGBENCH_RANDOM_EXPONENTIAL,
+ PGBENCH_RANDOM_ZIPFIAN
} PgBenchFunction;
typedef struct PgBenchExpr PgBenchExpr;
Hello Alik,
PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50%
UPDATE of random row by PK) on benchmarking with big number of clients
using Zipfian distribution. MySQL also has decline but it is not
significant as it is in PostgreSQL. MongoDB does not have decline at
all. And if pgbench would have Zipfian distribution random number
generator, everyone will be able to make research on this topic without
using YCSB.
Your description is not very precise. What version of Postgres is used? If
there is a decline, compared to which version? Is there a link to these
results?
This is the reason why I am currently working on random_zipfian function.
The bottleneck of algorithm that I use is that it calculates zeta
function (it has linear complexity -
https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause
problems on generating huge amount of big numbers.
Indeed, the function computation is over expensive, and the numerical
precision of the implementation is doubtful.
If there is no better way to compute this function, ISTM that it should be
summed in reverse order to accumulate small values first, from (1/n)^s +
... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in
calling pow for this one, so it could be:
double ans = 0.0;
for (i = n; i >= 2; i--)
ans += pow(1. / i, theta);
return 1.0 + ans;
That’s why I added caching for zeta value. And it works good for cases
when random_zipfian called with same parameters in script. For example:
That’s why I have a question: should I implement support of caching zeta
values for calls with different parameters, or not?
I do not envision the random_zipfian function to be used widely, but if it
is useful to someone this is fine with me. Could an inexpensive
exponential distribution be used instead for the same benchmarking
purpose?
If the functions when actually used is likely to be called with different
parameters, then some caching beyond the last value would seem in order.
Maybe a small fixed size array?
However, it should be somehow thread safe, which does not seem to be the
case with the current implementation. Maybe a per-thread cache? Or use a
lock only to update a shared cache? At least it should avoid locking to
read values...
P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f ycsb_read_zipf.sql -f ycsb_update_zipf.sql
Given the explanations, the random draw mostly hits values at the
beginning of the interval, so when the number of client goes higher one
just get locking contention on the updated row?
ISTM that also having the tps achieved with a flat distribution would
allow to check this hypothesis.
On scale = 10(1 million rows) it gives following results on machine with
144 cores(with synchronous_commit=off):
nclients tps
1 8842.401870
2 18358.140869
4 45999.378785
8 88713.743199
16 170166.998212
32 290069.221493
64 178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
1000 6188.136736
--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 7, 2017 at 3:45 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking with big number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in PostgreSQL. MongoDB does not have decline at all.
How is that possible? In a Zipfian distribution, no matter how big
the table is, almost all of the updates will be concentrated on a
handful of rows - and updates to any given row are necessarily
serialized, or so I would think. Maybe MongoDB can be fast there
since there are no transactions, so it can just lock the row slam in
the new value and unlock the row, all (I suppose) without writing WAL
or doing anything hard. But MySQL is going to have to hold the row
lock until transaction commit just like we do, or so I would think.
Is it just that their row locking is way faster than ours?
I'm more curious about why we're performing badly than I am about a
general-purpose random_zipfian function. :-)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 7, 2017 at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
How is that possible? In a Zipfian distribution, no matter how big
the table is, almost all of the updates will be concentrated on a
handful of rows - and updates to any given row are necessarily
serialized, or so I would think. Maybe MongoDB can be fast there
since there are no transactions, so it can just lock the row slam in
the new value and unlock the row, all (I suppose) without writing WAL
or doing anything hard.
If you're not using the Wired Tiger storage engine, than the locking
is at the document level, which means that a Zipfian distribution is
no worse than any other as far as lock contention goes. That's one
possible explanation. Another is that indexed organized tables
naturally have much better locality, which matters at every level of
the memory hierarchy.
I'm more curious about why we're performing badly than I am about a
general-purpose random_zipfian function. :-)
I'm interested in both. I think that a random_zipfian function would
be quite helpful for modeling certain kinds of performance problems,
like CPU cache misses incurred at the page level.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 7, 2017 at 12:45 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off):
nclients tps
1 8842.401870
2 18358.140869
4 45999.378785
8 88713.743199
16 170166.998212
32 290069.221493
64 178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
1000 6188.136736
Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?
If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.
There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.
Here is the query:
with recursive index_details as (
select
'pgbench_accounts_pkey'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
idx,
btpo as level,
item as l_item,
blkno,
btpo_prev,
btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key.
case when btpo_next != 0 then (select data from bt_page_items(idx,
blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan wrote:
Here is the query:
with recursive index_details as (
select
'pgbench_accounts_pkey'::text idx
), [...]
Hmm, this seems potentially very useful. Care to upload it to
https://wiki.postgresql.org/wiki/Category:Snippets ?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 7, 2017 at 12:59 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Hmm, this seems potentially very useful. Care to upload it to
https://wiki.postgresql.org/wiki/Category:Snippets ?
Sure. I've added it here, under "index maintenance":
https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index
It would be a nice additional touch if there was an easy way of taking
the on-disk representation of index tuples (in this case that would be
little-endian signed integers from bt_page_items()), and from that
output actual typed values. Maybe just for a few select datatypes.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hello, Fabien!
Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results?
Benchmark have been done in master v10. I am attaching image with results:
.
Indeed, the function computation is over expensive, and the numerical precision of the implementation is doubtful.
If there is no better way to compute this function, ISTM that it should be summed in reverse order to accumulate small values first, from (1/n)^s + ... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow for this one, so it could be:
double ans = 0.0;
for (i = n; i >= 2; i--)
ans += pow(1. / i, theta);
return 1.0 + ans;
You are right, it’s better to reverse order.
If the functions when actually used is likely to be called with different parameters, then some caching beyond the last value would seem in order. Maybe a small fixed size array?
However, it should be somehow thread safe, which does not seem to be the case with the current implementation. Maybe a per-thread cache? Or use a lock only to update a shared cache? At least it should avoid locking to read values…
Yea, I forget about thread-safety. I will implement per-thread cache with small fixed array.
Given the explanations, the random draw mostly hits values at the beginning of the interval, so when the number of client goes higher one just get locking contention on the updated row?
Yes, exactly.
ISTM that also having the tps achieved with a flat distribution would allow to check this hypothesis.
On Workload A with uniform distribution PostgreSQL shows better results than MongoDB and MySQL(see attachment). Also you can notice that for small number of clients type of distribution does not affect on tps on MySQL.
And it’s important to mention that postgres run with option synchronous_commit=off, to satisfy durability MongoDB writeConcern=1&journaled=false. In this mode there is possibility to lose all changes in the last second. If we run postgres with max durability MongoDB will lag far behind.
---
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Company
Attachments:
ycsb_workload_a.pngimage/png; name=ycsb_workload_a.pngDownload
�PNG
IHDR � � z��3 sRGB ��� @ IDATx��`T������^�@��#X ��E}������������>���.���E�.��� !=�d�������$��n�
�k\���9s����s���L:$ H�$ I@��$I�$ H�$ I@��$I�$ H�$ I@��$I�$ H�$ I@��$I�$ H�$ I@��$I�$ H�$ I@��$I�$ H�$���M^m��qqq��u��|A7�T*���x����HK���Pz��&B/��Vi�����*
��|��d'zP����m�{:����v��w?ty(��r�9��*K�G���hB�&������������1c�������/h
�S��299�h4F>�%%%���s�����jll,b���>71�V&L��c��K�i���oJJJqqq;�b]�tIKK�A
�N;T���w:�V���
� �^�u���&�)h:��w�i�����pLU�H F�{��
�y !�����5���k���i'�������E�T���y�"�O�
������y�G�KU��dee���TVV�*�e���AW?�*v0 �n�B��`��
&�#��J�D��F>��M���/��H#�U��N�������oM[��g�`&EQ�������QC�F�a��N��G{&�����t������]���w_z :���8�L H�T{���h��.&M�4r�������?��i��q;_|������q��v�\�>�h$��pB��{��j);���X@RB��$I�$ 4 ��h�'�x�<���h�`0\t�EO>��_|1}��=z�=������};�b1������N!�3 �N3��g\����K�z��F<��X��x�n+2�� c�odr��S0����,OigyD���Y�*�P!�V!V�/�H"<0���=��)����8�����/���4n����������������O���=��0��J���i�������*�r�K.W�|^29S�>����r�<��6Y�2�3�r�L
��p��g���Mf5+�J�E+��d.��k��2���i�<c{��!��N�~���<����S0�i��o����U�V����`������v�I(Mp��t�-��>��G(������l`�����|KyP
d���c�n����h��s�� �a���0 M���#9<v�!�*|(�
*s{@e<�����#��-�E+P��`B��0�v_L��p�k����DeT����c��Dg�5�re�B����1���t�A��IT�zMe�����Vht;��|���5*�eo��U�uV��&2�����@����@Y��9�.�_���� ���-���H��������<�v)�<#�UY^(��vIH��'���s�D��+�������$�3hV��<�?m�[�Bx ���Xr���',n'��'N�Xjj*P� 3,��GVoa�Y��{����g�yeZ��G����N����+��O�q���)rc7�����U��U=�V���_pN��y���=����C�Igi��Rh���]��e����CsT�b�M�5��U2y��V��8m�.k��e���nW�OV�;�I���p�������f��k�i$\�D��8���N� �T$�h<,
<��6m
�Dle��#b9C��Ov���A�z�fvvv��4r� �������7�����o������'