Re: gaussian distribution pgbench

Started by Mitsumasa KONDOover 11 years ago40 messages

kondo.mitsumasa@gmail.com

over 11 years ago

1 attachment(s)

Hello Fabien-san,

I have checked your v13 patch, and tested the new exponential distribution
generating algorithm. It works fine and less or no overhead than previous
version.
Great work! And I agree with your proposal.

And I'm also interested in your "decile percents" output like under
followings,

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=20
~
decile percents: 86.5% 11.7% 1.6% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
~
decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
~
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
~

I think that it is easy to understand exponential distribution when I check
the exponential parameter. I also agree with it. So I create decile
percents output
in gaussian distribution.
Here are the examples.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=20
~
decile percents: 0.0% 0.0% 0.0% 0.0% 50.0% 50.0% 0.0% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=10
~
decile percents: 0.0% 0.0% 0.0% 2.3% 47.7% 47.7% 2.3% 0.0% 0.0% 0.0%
~
[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --gaussian=5
~
decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

I think that it is easier than before. Sum of decile percents is just 100%.

However, I don't prefer "highest/lowest percentage" because it will be
confused
with decile percentage for users, and anyone cannot understand this
digits.

Here is example when sets exponential=5,

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
~
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
highest/lowest percent of the range: 4.9% 0.0%
~

I cannot understand "4.9%, 0.0%" when I see the first time.
Then, I checked the source code, I understood it:( It's not good design...
#Why this parameter use 100?
So I'd like to remove it if you like. It will be more simple.

Attached patch is fixed version, please confirm it.
#Of course, World Cup is being held now. I'm not hurry at all.

Best regards,
--
Mitsumasa KONDO

Attachments:

gaussian_and_exponential_pgbench_v14.patchtext/x-diff; charset=US-ASCII; name=gaussian_and_exponential_pgbench_v14.patchDownload

*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***************
*** 41,46 ****
--- 41,47 ----
  #include <math.h>
  #include <signal.h>
  #include <sys/time.h>
+ #include <assert.h>
  #ifdef HAVE_SYS_SELECT_H
  #include <sys/select.h>
  #endif
***************
*** 98,103 **** static int	pthread_join(pthread_t th, void **thread_return);
--- 99,106 ----
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***************
*** 171,176 **** bool		is_connect;			/* establish connection for each transaction */
--- 174,187 ----
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ bool        use_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***************
*** 332,337 **** static char *select_only = {
--- 343,430 ----
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ };
+ 
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --gaussian with -S case */
+ static char *gaussian_select_only = {
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ };
+ 
  /* Function prototypes */
  static void setalarm(int seconds);
  static void *threadRun(void *arg);
***************
*** 375,380 **** usage(void)
--- 468,475 ----
  		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
  		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
  		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+ 		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+ 		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
  		   "\nCommon options:\n"
  		   "  -d, --debug              print debugging output\n"
  	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
***************
*** 471,476 **** getrand(TState *thread, int64 min, int64 max)
--- 566,641 ----
  	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
  }
  
+ /* 
+  * random number generator: exponential distribution from min to max inclusive.
+  * the threshold is so that the density of probability for the last cut-off max
+  * value is exp(-exp_threshold).
+  */
+ static int64
+ getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+ {
+ 	double cut, uniform, rand;
+ 	assert(exp_threshold > 0.0);
+ 	cut = exp(-exp_threshold);
+ 	/* erand in [0, 1), uniform in (0, 1] */
+ 	uniform = 1.0 - pg_erand48(thread->random_state);
+ 	/*
+ 	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+ 	 * rand in [0, 1)
+ 	 */
+ 	assert((1.0 - cut) != 0.0);
+ 	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+ 	/* return int64 random number within between min and max */
+ 	return min + (int64)((max - min + 1) * rand);
+ }
+ 
+ /* random number generator: gaussian distribution from min to max inclusive */
+ static int64
+ getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+ {
+ 	double		stdev;
+ 	double		rand;
+ 
+ 	/*
+ 	 * Get user specified random number from this loop, with
+ 	 * -stdev_threshold < stdev <= stdev_threshold
+ 	 *
+ 	 * This loop is executed until the number is in the expected range.
+ 	 *
+ 	 * As the minimum threshold is 2.0, the probability of looping is low:
+ 	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+ 	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+ 	 * worst case. For a 5.0 threshold value, the looping proability
+ 	 * is about e^{-5} * 2 / pi ~ 0.43%.
+ 	 */
+ 	do
+ 	{
+ 		/*
+ 		 * pg_erand48 generates [0,1), but for the basic version of the
+ 		 * Box-Muller transform the two uniformly distributed random numbers
+ 		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+ 		 */
+ 		double rand1 = 1.0 - pg_erand48(thread->random_state);
+ 		double rand2 = 1.0 - pg_erand48(thread->random_state);
+ 
+ 		/* Box-Muller basic form transform */
+ 		double var_sqrt = sqrt(-2.0 * log(rand1));
+ 		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+ 
+ 		/* 
+  		 * we may try with cos, but there may be a bias induced if the previous
+ 		 * value fails the test? To be on the safe side, let us try over.
+ 		 */
+ 	}
+ 	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+ 
+ 	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+ 	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+ 
+ 	/* return int64 random number within between min and max */
+ 	return min + (int64)((max - min + 1) * rand);
+ }
+ 
  /* call PQexec() and exit() on failure */
  static void
  executeStatement(PGconn *con, const char *sql)
***************
*** 1319,1324 **** top:
--- 1484,1490 ----
  			char	   *var;
  			int64		min,
  						max;
+ 			double		threshold = 0;
  			char		res[64];
  
  			if (*argv[2] == ':')
***************
*** 1364,1374 **** top:
  			}
  
  			/*
! 			 * getrand() needs to be able to subtract max from min and add one
! 			 * to the result without overflowing.  Since we know max > min, we
! 			 * can detect overflow just by checking for a negative result. But
! 			 * we must check both that the subtraction doesn't overflow, and
! 			 * that adding one to the result doesn't overflow either.
  			 */
  			if (max - min < 0 || (max - min) + 1 < 0)
  			{
--- 1530,1540 ----
  			}
  
  			/*
! 			 * Generate random number functions need to be able to subtract
! 			 * max from min and add one to the result without overflowing.
! 			 * Since we know max > min, we can detect overflow just by checking
! 			 * for a negative result. But we must check both that the subtraction
! 			 * doesn't overflow, and that adding one to the result doesn't overflow either.
  			 */
  			if (max - min < 0 || (max - min) + 1 < 0)
  			{
***************
*** 1377,1386 **** top:
  				return true;
  			}
  
  #ifdef DEBUG
! 			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
  #endif
! 			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
  
  			if (!putVariable(st, argv[0], argv[1], res))
  			{
--- 1543,1605 ----
  				return true;
  			}
  
+ 			if (argc == 4) /* uniform */
+ 			{
  #ifdef DEBUG
! 				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
  #endif
! 				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
! 			}
! 			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
! 				 (pg_strcasecmp(argv[4], "exponential") == 0))
! 			{
! 				if (*argv[5] == ':')
! 				{
! 					if ((var = getVariable(st, argv[5] + 1)) == NULL)
! 					{
! 						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
! 						st->ecnt++;
! 						return true;
! 					}
! 					threshold = strtod(var, NULL);
! 				}
! 				else
! 					threshold = strtod(argv[5], NULL);
! 
! 				if (pg_strcasecmp(argv[4], "gaussian") == 0)
! 				{
! 					if (threshold < MIN_GAUSSIAN_THRESHOLD)
! 					{
! 						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
! 						st->ecnt++;
! 						return true;
! 					}
! #ifdef DEBUG
! 					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
! #endif
! 					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
! 				}
! 				else if (pg_strcasecmp(argv[4], "exponential") == 0)
! 				{
! 					if (threshold <= 0.0)
! 					{
! 						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
! 						st->ecnt++;
! 						return true;
! 					}
! #ifdef DEBUG
! 					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
! #endif
! 					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
! 				}
! 			}
! 			else /* uniform with extra arguments */
! 			{
! #ifdef DEBUG
! 				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
! #endif
! 				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
! 			}
  
  			if (!putVariable(st, argv[0], argv[1], res))
  			{
***************
*** 1920,1928 **** process_commands(char *buf)
  				exit(1);
  			}
  
! 			for (j = 4; j < my_commands->argc; j++)
! 				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
! 						my_commands->argv[0], my_commands->argv[j]);
  		}
  		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
  		{
--- 2139,2172 ----
  				exit(1);
  			}
  
! 			if (my_commands->argc == 4 ) /* uniform */
! 			{
! 				/* nothing to do */
! 			}
! 			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
! 				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
! 			{
! 				if (my_commands->argc < 6)
! 				{
! 					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
! 					exit(1);
! 				}
! 
! 				for (j = 6; j < my_commands->argc; j++)
! 					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
! 							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
! 			}
! 			else /* uniform with extra argument */
! 			{
! 				int arg_pos = 4;
! 
! 				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
! 					arg_pos++;
! 
! 				for (j = arg_pos; j < my_commands->argc; j++)
! 					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
! 							my_commands->argv[0], my_commands->argv[j]);
! 			}
  		}
  		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
  		{
***************
*** 2178,2183 **** process_builtin(char *tb)
--- 2422,2439 ----
  	return my_commands;
  }
  
+ /* 
+  * compute the probability of the truncated exponential random generation
+  * to draw values in the i-th slot of the range.
+  */
+ static double exponentialProbability(int i, int slots, double threshold)
+ {
+ 	assert(1 <= i && i <= slots);
+ 	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+ 		(1.0 - exp(- threshold));
+ }
+ 
+ 
  /* print out results */
  static void
  printResults(int ttype, int64 normal_xacts, int nclients,
***************
*** 2197,2212 **** printResults(int ttype, int64 normal_xacts, int nclients,
  						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
  
  	if (ttype == 0)
! 		s = "TPC-B (sort of)";
  	else if (ttype == 2)
! 		s = "Update only pgbench_accounts";
  	else if (ttype == 1)
! 		s = "SELECT only";
  	else
  		s = "Custom query";
  
  	printf("transaction type: %s\n", s);
  	printf("scaling factor: %d\n", scale);
  	printf("query mode: %s\n", QUERYMODE[querymode]);
  	printf("number of clients: %d\n", nclients);
  	printf("number of threads: %d\n", nthreads);
--- 2453,2521 ----
  						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
  
  	if (ttype == 0)
! 	{
! 		if (use_gaussian)
! 			s = "Gaussian distribution TPC-B (sort of)";
! 		else if (use_exponential)
! 			s = "Exponential distribution TPC-B (sort of)";
! 		else
! 			s = "TPC-B (sort of)";
! 	}
  	else if (ttype == 2)
! 	{
! 		if (use_gaussian)
! 			s = "Gaussian distribution update only pgbench_accounts";
! 		else if (use_exponential)
! 			s = "Exponential distribution update only pgbench_accounts";
! 		else
! 			s = "Update only pgbench_accounts";
! 	}
  	else if (ttype == 1)
! 	{
! 		if (use_gaussian)
! 			s = "Gaussian distribution SELECT only";
! 		else if (use_exponential)
! 			s = "Exponential distribution SELECT only";
! 		else
! 			s = "SELECT only";
! 	}
  	else
  		s = "Custom query";
  
  	printf("transaction type: %s\n", s);
  	printf("scaling factor: %d\n", scale);
+ 
+ 	/* output in gaussian distribution benchmark */
+ 	if (use_gaussian)
+ 	{
+ 		int i;
+ 		printf("standard deviation threshold: %.5f\n", stdev_threshold);
+ 		printf("decile percents:");
+ 		for (i = 2; i <= 20; i = i + 2)
+ 			printf(" %.1f%%", (double) 50 * (erf (stdev_threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+ 				erf (stdev_threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+ 				erf (stdev_threshold / sqrt(2.0)));
+ 		printf("\n");
+ //		printf("access probability of top 20%%, 10%% and 5%% records: %.5f %.5f %.5f\n",
+ //			(double) ((erf (stdev_threshold * 0.2 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+ //			(double) ((erf (stdev_threshold * 0.1 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+ //			(double) ((erf (stdev_threshold * 0.05 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))));
+ 	}
+ 	/* output in exponential distribution benchmark */
+ 	else if (use_exponential)
+ 	{
+ 		int i;
+ 		printf("exponential threshold: %.5f\n", exp_threshold);
+ 		printf("decile percents:");
+ 		for (i = 1; i <= 10; i++)
+ 			printf(" %.1f%%",
+ 				   100.0 * exponentialProbability(i, 10, exp_threshold));
+ 		printf("\n");
+ 		printf("highest/lowest percent of the range: %.1f%% %.1f%%\n",
+ 			   100.0 * exponentialProbability(1, 100, exp_threshold),
+ 			   100.0 * exponentialProbability(100, 100, exp_threshold));
+ 	}
+ 
  	printf("query mode: %s\n", QUERYMODE[querymode]);
  	printf("number of clients: %d\n", nclients);
  	printf("number of threads: %d\n", nthreads);
***************
*** 2337,2342 **** main(int argc, char **argv)
--- 2646,2653 ----
  		{"unlogged-tables", no_argument, &unlogged_tables, 1},
  		{"sampling-rate", required_argument, NULL, 4},
  		{"aggregate-interval", required_argument, NULL, 5},
+ 		{"gaussian", required_argument, NULL, 6},
+ 		{"exponential", required_argument, NULL, 7},
  		{"rate", required_argument, NULL, 'R'},
  		{NULL, 0, NULL, 0}
  	};
***************
*** 2617,2622 **** main(int argc, char **argv)
--- 2928,2952 ----
  				}
  #endif
  				break;
+ 			case 6:
+ 				use_gaussian = true;
+ 				stdev_threshold = atof(optarg);
+ 				if(stdev_threshold < MIN_GAUSSIAN_THRESHOLD)
+ 				{
+ 					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+ 							MIN_GAUSSIAN_THRESHOLD, stdev_threshold);
+ 					exit(1);
+ 				}
+ 				break;
+ 			case 7:
+ 				use_exponential = true;
+ 				exp_threshold = atof(optarg);
+ 				if(exp_threshold <= 0.0)
+ 				{
+ 					fprintf(stderr, "--exponential=NUM must be more 0.0\n");
+ 					exit(1);
+ 				}
+ 				break;
  			default:
  				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
  				exit(1);
***************
*** 2814,2819 **** main(int argc, char **argv)
--- 3144,3171 ----
  		}
  	}
  
+ 	/* set :stdev_threshold variable */
+ 	if(getVariable(&state[0], "stdev_threshold") == NULL)
+ 	{
+ 		snprintf(val, sizeof(val), "%lf", stdev_threshold);
+ 		for (i = 0; i < nclients; i++)
+ 		{
+ 			if (!putVariable(&state[i], "startup", "stdev_threshold", val))
+ 				exit(1);
+ 		}
+ 	}
+ 
+ 	/* set :exp_threshold variable */
+ 	if(getVariable(&state[0], "exp_threshold") == NULL)
+ 	{
+ 		snprintf(val, sizeof(val), "%lf", exp_threshold);
+ 		for (i = 0; i < nclients; i++)
+ 		{
+ 			if (!putVariable(&state[i], "startup", "exp_threshold", val))
+ 				exit(1);
+ 		}
+ 	}
+ 
  	if (!is_no_vacuum)
  	{
  		fprintf(stderr, "starting vacuum...");
***************
*** 2839,2855 **** main(int argc, char **argv)
  	switch (ttype)
  	{
  		case 0:
! 			sql_files[0] = process_builtin(tpc_b);
  			num_files = 1;
  			break;
  
  		case 1:
! 			sql_files[0] = process_builtin(select_only);
  			num_files = 1;
  			break;
  
  		case 2:
! 			sql_files[0] = process_builtin(simple_update);
  			num_files = 1;
  			break;
  
--- 3191,3222 ----
  	switch (ttype)
  	{
  		case 0:
! 			if (use_gaussian)
! 				sql_files[0] = process_builtin(gaussian_tpc_b);
! 			else if (use_exponential)
! 				sql_files[0] = process_builtin(exponential_tpc_b);
! 			else
! 				sql_files[0] = process_builtin(tpc_b);
  			num_files = 1;
  			break;
  
  		case 1:
! 			if (use_gaussian)
! 				sql_files[0] = process_builtin(gaussian_select_only);
! 			else if (use_exponential)
! 				sql_files[0] = process_builtin(exponential_select_only);
! 			else
! 				sql_files[0] = process_builtin(select_only);
  			num_files = 1;
  			break;
  
  		case 2:
! 			if (use_gaussian)
! 				sql_files[0] = process_builtin(gaussian_simple_update);
! 			else if (use_exponential)
! 				sql_files[0] = process_builtin(exponential_simple_update);
! 			else
! 				sql_files[0] = process_builtin(simple_update);
  			num_files = 1;
  			break;
  
*** a/doc/src/sgml/pgbench.sgml
--- b/doc/src/sgml/pgbench.sgml
***************
*** 307,312 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 307,327 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>--exponential</option><replaceable>threshold</></term>
+       <listitem>
+        <para>
+          Run exponential distribution pgbench test using this threshold parameter.
+          The threshold controls the distribution of access frequency on the
+          <structname>pgbench_accounts</> table.
+          See the <literal>\setrandom</> documentation below for details about
+          the impact of the threshold value.
+          When set, this option applies to all test variants (<option>-N</> for
+          skipping updates, or <option>-S</> for selects).
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-f</option> <replaceable>filename</></term>
        <term><option>--file=</option><replaceable>filename</></term>
        <listitem>
***************
*** 320,325 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 335,355 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>--gaussian</option><replaceable>threshold</></term>
+       <listitem>
+        <para>
+          Run gaussian distribution pgbench test using this threshold parameter.
+          The threshold controls the distribution of access frequency on the
+          <structname>pgbench_accounts</> table.
+          See the <literal>\setrandom</> documentation below for details about
+          the impact of the threshold value.
+          When set, this option applies to all test variants (<option>-N</> for
+          skipping updates, or <option>-S</> for selects).
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-j</option> <replaceable>threads</></term>
        <term><option>--jobs=</option><replaceable>threads</></term>
        <listitem>
***************
*** 748,755 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
  
     <varlistentry>
      <term>
!      <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
!     </term>
  
      <listitem>
       <para>
--- 778,785 ----
  
     <varlistentry>
      <term>
!      <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
!      </term>
  
      <listitem>
       <para>
***************
*** 761,769 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
       </para>
  
       <para>
        Example:
  <programlisting>
! \setrandom aid 1 :naccounts
  </programlisting></para>
      </listitem>
     </varlistentry>
--- 791,834 ----
       </para>
  
       <para>
+       The default random distribution is uniform. The gaussian and exponential
+       options allow to change the distribution. The mandatory
+       <replaceable>threshold</> double value controls the actual distribution.
+      </para>
+ 
+      <para>
+       With the gaussian option, the larger the <replaceable>threshold</>,
+       the more frequently values close to the middle of the interval are drawn,
+       and the less frequently values close to the <replaceable>min</> and
+       <replaceable>max</> bounds.
+       In other worlds, the larger the <replaceable>threshold</>,
+       the narrower the access range around the middle.
+       the smaller the threshold, the smoother the access pattern
+       distribution. The minimum threshold is 2.0 for performance.
+      </para>
+ 
+      <para>
+       With the exponential option, the <replaceable>threshold</> parameter
+       controls the distribution by truncating an exponential distribution at
+       a specific value, and then projecting onto integers between the bounds.
+       To be precise, the <replaceable>threshold</> is so that the density of
+       probability of the exponential distribution at the <replaceable>max</>
+       cut-off value is exp(-threshold), the density at the <replaceable>min</>
+       value being 1.
+       Intuitively, the larger the threshold, the more frequently values close to
+       <replaceable>min</> are accessed, and the less frequently values close to
+       <replaceable>max</> are accessed.
+       A crude approximation of the distribution is that the most frequent 1%
+       values are drawn <replaceable>threshold</>% of the time.
+       The closer to 0.0 the threshold, the flatter (more uniform) the access
+       distribution.
+       The threshold value must be strictly positive with the exponential option.
+      </para>
+ 
+      <para>
        Example:
  <programlisting>
! \setrandom aid 1 :naccounts gaussian 5.0
  </programlisting></para>
      </listitem>
     </varlistentry>

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Mitsumasa KONDO (#1)

1 attachment(s)

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just 100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will be
confused with decile percentage for users, and anyone cannot understand
this digits. I cannot understand "4.9%, 0.0%" when I see the first time.
Then, I checked the source code, I understood it:( It's not good
design... #Why this parameter use 100?

What else? People have ten fingers and like powers of 10, and are used to
percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for
high threshold, to have the lowest/highest percent density. For low
thresholds, the decile is also definitely useful. So I'm fine with both
outputs as you have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update. I'm
validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

Have a nice day/night,

--
Fabien.

Attachments:

gaussian_and_exponential_pgbench_v15.patchtext/x-diff; name=gaussian_and_exponential_pgbench_v15.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..3541b7e 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -171,6 +174,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+bool        use_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -332,6 +343,88 @@ static char *select_only = {
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --gaussian with -N case */
+static char *gaussian_simple_update = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --gaussian with -S case */
+static char *gaussian_select_only = {
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+};
+
 /* Function prototypes */
 static void setalarm(int seconds);
 static void *threadRun(void *arg);
@@ -375,6 +468,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -471,6 +566,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold > 0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold < stdev <= stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1484,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1530,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1543,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2139,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
@@ -2178,6 +2422,18 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2197,16 +2453,69 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
 
 	if (ttype == 0)
-		s = "TPC-B (sort of)";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution TPC-B (sort of)";
+		else if (use_exponential)
+			s = "Exponential distribution TPC-B (sort of)";
+		else
+			s = "TPC-B (sort of)";
+	}
 	else if (ttype == 2)
-		s = "Update only pgbench_accounts";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution update only pgbench_accounts";
+		else if (use_exponential)
+			s = "Exponential distribution update only pgbench_accounts";
+		else
+			s = "Update only pgbench_accounts";
+	}
 	else if (ttype == 1)
-		s = "SELECT only";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution SELECT only";
+		else if (use_exponential)
+			s = "Exponential distribution SELECT only";
+		else
+			s = "SELECT only";
+	}
 	else
 		s = "Custom query";
 
 	printf("transaction type: %s\n", s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("standard deviation threshold: %.5f\n", stdev_threshold);
+		printf("decile percents:");
+		for (i = 2; i <= 20; i = i + 2)
+			printf(" %.1f%%", (double) 50 * (erf (stdev_threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+				erf (stdev_threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+				erf (stdev_threshold / sqrt(2.0)));
+		printf("\n");
+//		printf("access probability of top 20%%, 10%% and 5%% records: %.5f %.5f %.5f\n",
+//			(double) ((erf (stdev_threshold * 0.2 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+//			(double) ((erf (stdev_threshold * 0.1 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+//			(double) ((erf (stdev_threshold * 0.05 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))));
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("exponential threshold: %.5f\n", exp_threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, exp_threshold));
+		printf("\n");
+		printf("highest/lowest percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, exp_threshold),
+			   100.0 * exponentialProbability(100, 100, exp_threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2337,6 +2646,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2617,6 +2928,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				stdev_threshold = atof(optarg);
+				if(stdev_threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, stdev_threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				exp_threshold = atof(optarg);
+				if(exp_threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2814,6 +3144,28 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :stdev_threshold variable */
+	if(getVariable(&state[0], "stdev_threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", stdev_threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "stdev_threshold", val))
+				exit(1);
+		}
+	}
+
+	/* set :exp_threshold variable */
+	if(getVariable(&state[0], "exp_threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", exp_threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "exp_threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2839,17 +3191,32 @@ main(int argc, char **argv)
 	switch (ttype)
 	{
 		case 0:
-			sql_files[0] = process_builtin(tpc_b);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_tpc_b);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_tpc_b);
+			else
+				sql_files[0] = process_builtin(tpc_b);
 			num_files = 1;
 			break;
 
 		case 1:
-			sql_files[0] = process_builtin(select_only);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_select_only);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_select_only);
+			else
+				sql_files[0] = process_builtin(select_only);
 			num_files = 1;
 			break;
 
 		case 2:
-			sql_files[0] = process_builtin(simple_update);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_simple_update);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_simple_update);
+			else
+				sql_files[0] = process_builtin(simple_update);
 			num_files = 1;
 			break;
 
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 4367563..3a561a9 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,21 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +335,21 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>
@@ -748,8 +778,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +791,44 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      The default random distribution is uniform. The gaussian and exponential
+      options allow to change the distribution. The mandatory
+      <replaceable>threshold</> double value controls the actual distribution.
+     </para>
+
+     <para>
+      With the gaussian option, the larger the <replaceable>threshold</>,
+      the more frequently values close to the middle of the interval are drawn,
+      and the less frequently values close to the <replaceable>min</> and
+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>
+
+     <para>
+      With the exponential option, the <replaceable>threshold</> parameter
+      controls the distribution by truncating an exponential distribution at
+      a specific value, and then projecting onto integers between the bounds.
+      To be precise, the <replaceable>threshold</> is so that the density of
+      probability of the exponential distribution at the <replaceable>max</>
+      cut-off value is exp(-threshold), the density at the <replaceable>min</>
+      value being 1.
+      Intuitively, the larger the threshold, the more frequently values close to
+      <replaceable>min</> are accessed, and the less frequently values close to
+      <replaceable>max</> are accessed.
+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.
+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.
+      The threshold value must be strictly positive with the exponential option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Fabien COELHO (#2)

1 attachment(s)

I have just updated the wording so that it may be clearer:

Oops, I have sent the wrong patch, without the wording fix. Here is the
real updated version, which I tested.

probability of fist/last percent of the range: 11.3% 0.0%

--
Fabien.

Attachments:

gaussian_and_exponential_pgbench_v15b.patchtext/x-diff; name=gaussian_and_exponential_pgbench_v15b.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..f8ad17e 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -171,6 +174,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+bool        use_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -332,6 +343,88 @@ static char *select_only = {
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --gaussian with -N case */
+static char *gaussian_simple_update = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --gaussian with -S case */
+static char *gaussian_select_only = {
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+};
+
 /* Function prototypes */
 static void setalarm(int seconds);
 static void *threadRun(void *arg);
@@ -375,6 +468,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -471,6 +566,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold > 0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold < stdev <= stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1484,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1530,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1543,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2139,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
@@ -2178,6 +2422,18 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2197,16 +2453,69 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));
 
 	if (ttype == 0)
-		s = "TPC-B (sort of)";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution TPC-B (sort of)";
+		else if (use_exponential)
+			s = "Exponential distribution TPC-B (sort of)";
+		else
+			s = "TPC-B (sort of)";
+	}
 	else if (ttype == 2)
-		s = "Update only pgbench_accounts";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution update only pgbench_accounts";
+		else if (use_exponential)
+			s = "Exponential distribution update only pgbench_accounts";
+		else
+			s = "Update only pgbench_accounts";
+	}
 	else if (ttype == 1)
-		s = "SELECT only";
+	{
+		if (use_gaussian)
+			s = "Gaussian distribution SELECT only";
+		else if (use_exponential)
+			s = "Exponential distribution SELECT only";
+		else
+			s = "SELECT only";
+	}
 	else
 		s = "Custom query";
 
 	printf("transaction type: %s\n", s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("standard deviation threshold: %.5f\n", stdev_threshold);
+		printf("decile percents:");
+		for (i = 2; i <= 20; i = i + 2)
+			printf(" %.1f%%", (double) 50 * (erf (stdev_threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+				erf (stdev_threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+				erf (stdev_threshold / sqrt(2.0)));
+		printf("\n");
+//		printf("access probability of top 20%%, 10%% and 5%% records: %.5f %.5f %.5f\n",
+//			(double) ((erf (stdev_threshold * 0.2 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+//			(double) ((erf (stdev_threshold * 0.1 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))),
+//			(double) ((erf (stdev_threshold * 0.05 / sqrt(2.0))) / (erf (stdev_threshold / sqrt(2.0)))));
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("exponential threshold: %.5f\n", exp_threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, exp_threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, exp_threshold),
+			   100.0 * exponentialProbability(100, 100, exp_threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2337,6 +2646,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2617,6 +2928,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				stdev_threshold = atof(optarg);
+				if(stdev_threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, stdev_threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				exp_threshold = atof(optarg);
+				if(exp_threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2814,6 +3144,28 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :stdev_threshold variable */
+	if(getVariable(&state[0], "stdev_threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", stdev_threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "stdev_threshold", val))
+				exit(1);
+		}
+	}
+
+	/* set :exp_threshold variable */
+	if(getVariable(&state[0], "exp_threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", exp_threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "exp_threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2839,17 +3191,32 @@ main(int argc, char **argv)
 	switch (ttype)
 	{
 		case 0:
-			sql_files[0] = process_builtin(tpc_b);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_tpc_b);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_tpc_b);
+			else
+				sql_files[0] = process_builtin(tpc_b);
 			num_files = 1;
 			break;
 
 		case 1:
-			sql_files[0] = process_builtin(select_only);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_select_only);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_select_only);
+			else
+				sql_files[0] = process_builtin(select_only);
 			num_files = 1;
 			break;
 
 		case 2:
-			sql_files[0] = process_builtin(simple_update);
+			if (use_gaussian)
+				sql_files[0] = process_builtin(gaussian_simple_update);
+			else if (use_exponential)
+				sql_files[0] = process_builtin(exponential_simple_update);
+			else
+				sql_files[0] = process_builtin(simple_update);
 			num_files = 1;
 			break;
 
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 4367563..3a561a9 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,21 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +335,21 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>
@@ -748,8 +778,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +791,44 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      The default random distribution is uniform. The gaussian and exponential
+      options allow to change the distribution. The mandatory
+      <replaceable>threshold</> double value controls the actual distribution.
+     </para>
+
+     <para>
+      With the gaussian option, the larger the <replaceable>threshold</>,
+      the more frequently values close to the middle of the interval are drawn,
+      and the less frequently values close to the <replaceable>min</> and
+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>
+
+     <para>
+      With the exponential option, the <replaceable>threshold</> parameter
+      controls the distribution by truncating an exponential distribution at
+      a specific value, and then projecting onto integers between the bounds.
+      To be precise, the <replaceable>threshold</> is so that the density of
+      probability of the exponential distribution at the <replaceable>max</>
+      cut-off value is exp(-threshold), the density at the <replaceable>min</>
+      value being 1.
+      Intuitively, the larger the threshold, the more frequently values close to
+      <replaceable>min</> are accessed, and the less frequently values close to
+      <replaceable>max</> are accessed.
+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.
+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.
+      The threshold value must be strictly positive with the exponential option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

Gavin Flower

GavinFlower@archidevsys.co.nz

over 11 years ago

In reply to: Fabien COELHO (#2)

On 02/07/14 21:05, Fabien COELHO wrote:

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just
100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will
be confused with decile percentage for users, and anyone cannot
understand this digits. I cannot understand "4.9%, 0.0%" when I see
the first time. Then, I checked the source code, I understood it:(
It's not good design... #Why this parameter use 100?

What else? People have ten fingers and like powers of 10, and are used
to percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for
high threshold, to have the lowest/highest percent density. For low
thresholds, the decile is also definitely useful. So I'm fine with
both outputs as you have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update.
I'm validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

Have a nice day/night,

I would suggest that probabilities should NEVER be expressed in
percentages! As a percentage probability looks weird, and is never used
for serious statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e.
0.35 rather than 35%.

Cheers,
Gavin

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Gavin Flower (#4)

Hello Gavin,

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

I would suggest that probabilities should NEVER be expressed in percentages!
As a percentage probability looks weird, and is never used for serious
statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e. 0.35
rather than 35%.

I could agree about the mathematics, but ISTM that "11.5%" is more
readable and intuitive than "0.115".

I could change "probability" and replace it with "frequency" or maybe
"occurence", what would you think about that?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Gavin Flower

GavinFlower@archidevsys.co.nz

over 11 years ago

In reply to: Fabien COELHO (#5)

On 03/07/14 20:58, Fabien COELHO wrote:

Hello Gavin,

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

I would suggest that probabilities should NEVER be expressed in
percentages! As a percentage probability looks weird, and is never
used for serious statistical work - in my experience at least.

I think probabilities should be expressed in the range 0 ... 1 - i.e.
0.35 rather than 35%.

I could agree about the mathematics, but ISTM that "11.5%" is more
readable and intuitive than "0.115".

I could change "probability" and replace it with "frequency" or maybe
"occurence", what would you think about that?

You may well be hitting a situation, where you meet opposition whatever
you do! :-)

"frequency" implies a positive integer (though "relative frequency"
might be okay) - and if you use "occurrence", someone else is bound to
complain...

Though, I'd opt for "relative frequency", if you can't use values in the
range 0 ... 1 for probabilities, if %'s are used - so long as it does
not generate a flame war.

I suspect it may not be worth the grief to change.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#2)

On Wed, Jul 2, 2014 at 6:05 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Mitsumasa-san,

And I'm also interested in your "decile percents" output like under
followings,
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%

Sure, I'm really fine with that.

I think that it is easier than before. Sum of decile percents is just
100%.

That's a good property:-)

However, I don't prefer "highest/lowest percentage" because it will be
confused with decile percentage for users, and anyone cannot understand this
digits. I cannot understand "4.9%, 0.0%" when I see the first time. Then, I
checked the source code, I understood it:( It's not good design... #Why this
parameter use 100?

What else? People have ten fingers and like powers of 10, and are used to
percents?

So I'd like to remove it if you like. It will be more simple.

I think that for the exponential distribution it helps, especially for high
threshold, to have the lowest/highest percent density. For low thresholds,
the decile is also definitely useful. So I'm fine with both outputs as you
have put them.

I have just updated the wording so that it may be clearer:

decile percents: 69.9% 21.0% 6.3% 1.9% 0.6% 0.2% 0.1% 0.0% 0.0% 0.0%
probability of fist/last percent of the range: 11.3% 0.0%

Attached patch is fixed version, please confirm it.

Attached a v15 which just fixes a typo and the above wording update. I'm
validating it for committers.

#Of course, World Cup is being held now. I'm not hurry at all.

I'm not a soccer kind of person, so it does not influence my
availibility.:-)

Suggested commit message:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.
OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Fujii Masao (#7)

On 2014-07-03 21:27:53 +0900, Fujii Masao wrote:

Add drawing random integers with a Gaussian or truncated exponentional
distributions to pgbench.

Test variants with these distributions are also provided and triggered
with options "--gaussian=..." and "--exponential=...".

IIRC we've not reached consensus about whether we should support
such options in pgbench. Several hackers disagreed to support them.

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word
difference. Independent of whether we want those options, I don't think
that's going to fly.

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Andres Freund (#8)

Yea. I certainly disagree with the patch in it's current state because
it copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom
script.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Fabien COELHO (#9)

On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:

Yea. I certainly disagree with the patch in it's current state because it
copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

I think we also need documentation about the actual mathematical
behaviour of the randomness generators.

+     <para>
+      With the gaussian option, the larger the <replaceable>threshold</>,
+      the more frequently values close to the middle of the interval are drawn,
+      and the less frequently values close to the <replaceable>min</> and
+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>

The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Andres Freund (#10)

Hi,

2014-07-04 19:05 GMT+09:00 Andres Freund <andres@2ndquadrant.com>:

On 2014-07-04 11:59:23 +0200, Fabien COELHO wrote:

Yea. I certainly disagree with the patch in it's current state because

it

copies the same 15 lines several times with a two word difference.
Independent of whether we want those options, I don't think that's going
to fly.

I liked a simple static string for the different variants, which means
replication. Factorizing out the (large) common part will mean malloc &
sprintf. Well, why not.

It sucks from a maintenance POV. And I don't see the overhead of malloc
being relevant here...

OTOH, we've almost reached the consensus that supporting gaussian
and exponential options in \setrandom. So I think that you should
separate those two features into two patches, and we should apply
the \setrandom one first. Then we can discuss whether the other patch
should be applied or not.

Sounds like a good plan.

Sigh. I'll do that as it seems to be a blocker...

I still agree with Fabien-san. I cannot understand why our logical proposal
isn't accepted...

I think we also need documentation about the actual mathematical

behaviour of the randomness generators.
+     <para>
+      With the gaussian option, the larger the
<replaceable>threshold</>,

+ the more frequently values close to the middle of the interval

are drawn,

+ and the less frequently values close to the <replaceable>min</>

and
+      <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      the smaller the threshold, the smoother the access pattern
+      distribution. The minimum threshold is 2.0 for performance.
+     </para>
The only way to actually understand the distribution here is to create a
table, insert random values, and then look at the result. That's not a
good thing.

That's right. Therefore, we create command line option to easy to
understand parametrized Gaussian distribution.
When you want to know the parameter of distribution, you can use command
line option like under followings.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000
decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=5
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 5.00000
decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
highest/lowest percent of the range: 4.9% 0.0%

If you have a better method than our method, please share us.

The caveat that I have is that without these options there is:

(1) no return about the actual distributions in the final summary, which
depend on the threshold value, and

(2) no included mean to test the feature, so the first patch is less
meaningful if the feature cannot be used simply and require a custom

script.

I personally agree that we likely want that as an additional
feature. Even if just because it makes the results easier to compare.

If we can do positive and logical discussion, I will agree with the
proposal about separate patches.
However, I think that most opposite hacker decided by his feelings...
Actuary, he didn't answer to our proposal about understanding the
parametrized distribution...
So I also think it is blocker. Command line feature is also needed.
Besides, is there a other good method? Please share us.

Best regards,
--
Mitsumasa KONDO

#12

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Mitsumasa KONDO (#11)

On Sun, Jul 13, 2014 at 2:27 AM, Mitsumasa KONDO
<kondo.mitsumasa@gmail.com> wrote:

I still agree with Fabien-san. I cannot understand why our logical proposal
isn't accepted...

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

And this isn't helping me a bit:

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs. The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference. Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses. For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on. --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#12)

Hello Robert,

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

We can try to improve the documentation, once more!

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

The --exponential option causes pgbench to select lower-numbered
account IDs exponentially more frequently than higher-numbered account
IDs. The argument to --exponential controls the strength of the
preference for lower-numbered account IDs, with a smaller value
indicating a stronger preference. Specifically, it is the percentage
of the total number of account IDs which will receive half the total
accesses. For example, with --exponential=10, half the accesses will
be to the smallest 10 percent of the account IDs; half the remaining
accesses will be to the next-smallest 10 percent of account IDs, and
so on. --exponential=50 therefore represents a completely flat
distribution; larger values are not allowed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: gaussian distribution pgbench -- part 1/2

pgbench with gaussian & exponential, part 1 of 2.

This patch is a subset of the previous patch which only adds the two
new \setrandom gaussian and exponantial variants, but not the
adapted pgbench test cases, as suggested by Fujii Masao.
There is no new code nor code changes.

The corresponding documentation has been yet again extended wrt
to the initial patch, so that what is achieved is hopefully unambiguous
(there are two mathematical formula, tasty!), in answer to Andres Freund
comments, and partly to Robert Haas comments as well.

This patch also provides several sql/pgbench scripts and a README, so
that the feature can be tested. I do not know whether these scripts
should make it to postgresql. I would say yes, otherwise there is no way
to test...

part 2 which provide adapted pgbench test cases will come later.

--
Fabien.

Attachments:

gauss_A_17.patchtext/x-diff; name=gauss_A_17.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..4b8fd59
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -f test-XXX-run.sql -t 1000000 -P 1 test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..a80c0a5 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold > 0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold < stdev <= stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1392,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1438,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1451,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2047,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..f46c21d
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION phi(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (phi(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          phi(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * phi(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..babf88a 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,58 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      The default random distribution is uniform, that is all values in the
+      range are drawn with equal probability. The gaussian and exponential
+      options allow to change this default. The mandatory
+      <replaceable>threshold</> double value controls the actual distribution
+      with gaussian or exponential.
+     </para>
+
+     <para>
+      With the gaussian option, the interval is mapped onto a normal
+      distribution truncated at <literal>-threshold</> on the left and
+      <literal>+threshold</> on the right.
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
+      The smaller the threshold, the smoother the access pattern distribution.
+      To be precise, if <literal>phi(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability about:
+      <literal>
+        (phi(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         phi(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * phi(threshold) - 1.0)
+      </>
+      The minimum threshold is 2.0 for performance of the Box-Muller transform.
+     </para>
+
+     <para>
+      With the exponential option, the <replaceable>threshold</> parameter
+      controls the distribution by truncating an exponential distribution at
+      a specific value, and then projecting onto integers between the bounds.
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      Intuitively, the larger the threshold, the more frequently values close to
+      <replaceable>min</> are accessed, and the less frequently values close to
+      <replaceable>max</> are accessed.
+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.
+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.
+      The threshold value must be strictly positive with the exponential option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

#15

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#13)

On Wed, Jul 16, 2014 at 12:57 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Well, I think the feedback has been pretty clear, honestly. Here's
what I'm unhappy about: I can't understand what these options are
actually doing.

We can try to improve the documentation, once more!

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution. That having been said, I
agree the current documentation for the exponential distribution is
not too bad. But this part does not make sense:

+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.
+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.

Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

The documentation for the Gaussian distribution is in somewhat worse
shape. Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like. The closest it comes is this:

+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.

But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior, the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more. (Also, the English expression is "in other
words" not "in other worlds" - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely. And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
"gaussian" and "exponential" are referring to the option and which are
referring to the abstract mathematical concept. When you're talking
about the option, you should use all lower-case (as you've done) but
with <literal> tags or similar. When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#15)

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

Here is an example of an explanation that would make sense to me.
This is not the actual behavior of your patch, I'm quite sure, so this
is just an example of the *kind* of explanation that I think is
needed:

This is more or less the approximate behavior of the patch, but for 1% of
the range, not 50%. However I'm not sure that the current documentation is
so bad.

I think it isn't, because in the system I described, a larger value
indicates a flatter distribution, but in the documentation, a smaller
value indicates a flatter distribution.

Ok. But the general thrust was ok.

That having been said, I agree the current documentation for the
exponential distribution is not too bad. But this part does not make
sense:
+      A crude approximation of the distribution is that the most frequent 1%
+      values are drawn <replaceable>threshold</>% of the time.

I'm trying to be nice to the reader by providing an intuitive
information. I do not seem to succeed:-) I'm attempting to say that when
you draw from a range, say 1 to 1000, the first 1%, i.e. values 1 to 10,
are draw about "threshold"% of the time.

If I draw from one hundred values:

\setrandom x 1 100 exponential 10.0

The 1 will be drawn about 10% of the time, and the 99 next values will
share the remaining 90%.

+      The closer to 0.0 the threshold, the flatter (more uniform) the access
+      distribution.
Given the first statement, I'd expect the lowest possible threshold to
be 0.01, not 0.

This is in the sense of "epsilon", small number close to 0 but different
from 0. The lowest possible threshold is the smallest
strictly positive representable with a "double".

The documentation for the Gaussian distribution is in somewhat worse
shape. Unlike the documentation for exponential, it makes no attempt
at all to give the user a clear idea what the distribution actually
looks like. The closest it comes is this:
+      In other worlds, the larger the <replaceable>threshold</>,
+      the narrower the access range around the middle.
But that's not really very close - there's no way for a user to judge
what impact the threshold parameter actually has except to try it.
Unlike the discussion of exponential, which contains a fairly-precise
mathematical characterization of the behavior,

I have now added a precise formula for Gaussian. When you see the formula,
maybe you still would want see the decile to have an intuition.

I think that we assumed that the reader would know that a gaussian
distribution is the classic bell-shaped distribution, and if not .?he
would not be interested anyway.

the Gaussian stuff has
nothing except a hand-wavy explanation that a higher threshold skews
the distribution more. (Also, the English expression is "in other
words" not "in other worlds" - but in fact the phrase has no business
in that sentence at all, because it is not reiterating the contents of
the previous sentence in different language, but rather making a new
point entirely. And the following sentence does not start with a
capital letter, though maybe that's because it was intended to be
incorporated into this sentence somehow.)

I think that you also need to consider which instances of the words
"gaussian" and "exponential" are referring to the option and which are
referring to the abstract mathematical concept. When you're talking
about the option, you should use all lower-case (as you've done) but
with <literal> tags or similar. When you're referring to the
mathematical distribution, Gaussian should be capitalized.

BTW, I agree with both Heikki's suggestion that we make these options
to setrandom only and not expose command-line options for them, and
with Andres's critique that the documentation of those options is far
too repetitive.

I'll have yet another ago at trying to improve the documentation, esp the
gaussian part. However you must allow that these are Mathematics, and the
user who wants to use these distribution will be expected to understand
what they are somehow beforehand.

Moreover, I cannot make it precise, intuitive and very short.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#16)

2014-07-18 5:13 GMT+09:00 Fabien COELHO <coelho@cri.ensmp.fr>:

However, ISTM that it is not the purpose of pgbench documentation to be a

primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as
well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10

starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

Yeah, maybe, I and Fabien-san don't believe that he doesn't know the decile
percentage.
However, I think more description about decile is needed.

For example, when we set the number of transaction 10,000 (-t 10000),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
1 to 10,000 => 6,320 times
10,001 to 20,000 => 2,330 times
20,001 to 30,000 => 860 times
...
90,001 to 10,0000 => 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
1 to 1,000 => 950 times
...
99,001 to 10,0000 => 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.

Best regards,
--
Mitsumasa KONDO

#18

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Mitsumasa KONDO (#17)

For example, when we set the number of transaction 10,000 (-t 10000),
range of aid is 100,000,
and --exponential is 10, decile percents is under following as you know.

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

They mean that,
#number of access in range of aid (from decile percents):
1 to 10,000 => 6,320 times
10,001 to 20,000 => 2,330 times
20,001 to 30,000 => 860 times
...
90,001 to 10,0000 => 0 times

#number of access in range of aid (from highest/lowest percent of the
range):
1 to 1,000 => 950 times
...
99,001 to 10,0000 => 0 times

that's all.

Their information is easy to understand distribution of access probability,
isn't it?
Maybe I and Fabien-san have a knowledge of mathematics, so we think decile
percentage is common sense.
But if it isn't common sense, I agree with adding about these explanation
in the documents.

What we are talking about is the "summary" at the end of the run, which is
expected to be compact, hence the terse few lines.

I'm not sure how to make it explicit without extending the summary too
much, so it would not be a summary anymore:-)

My initial assumption is that anyone interested enough in changing the
default uniform distribution for a test would know about decile, but that
seems to be optimistic.

Maybe it would be okay to keep a terse summary but to expand the
documentation to explain what it means, as you suggested above...

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#15)

2 attachment(s)

Please find attached 2 patches, which are a split of the patch discussed
in this thread.

(A) add gaussian & exponential options to pgbench \setrandom
the patch includes sql test files.

There is no change in the *code* from previous already reviewed
submissions, so I do not think that it needs another review on that
account.

However I have (yet again) reworked the *documentation* (for Andres Freund
& Robert Haas), in particular both descriptions now follow the same
structure (introduction, formula, intuition, rule of thumb and
constraint). I have differentiated the concept and the option by putting
the later in <literal> tags, and added a link to the corresponding
wikipedia pages.

Please bear in mind that:
1. English is not my native language.
2. this is not easy reading... this is maths, to read slowly:-)
3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian & exponential
distributions must know a little bit about probabilities...

(B) add pgbench test variants with gauss & exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as
requested by Andres Freund, thus this is new, although quite simple, code.
I have also added explanations in the documentation about how to interpret
the "decile" outputs, so as to hopefully address Robert Haas comments.

--
Fabien.

Attachments:

gauss_A_2.patchtext/x-diff; name=gauss_A_2.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 1000000 -P 1 -n test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..a80c0a5 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-exp_threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+{
+	double cut, uniform, rand;
+	assert(exp_threshold > 0.0);
+	cut = exp(-exp_threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -stdev_threshold < stdev <= stdev_threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1392,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1438,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1451,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2047,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..7d56117
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION PHI(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (PHI(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          PHI(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * PHI(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..d6c49d4 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,75 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      The default random distribution is <literal>uniform</>, that is all
+      values in the range are drawn with equal probability.
+      The <literal>gaussian</> and <literal>exponential</>  options allow to
+      change this default, with a mandatory <replaceable>threshold</> double
+      value to control the actual distribution.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      With the <literal>gaussian</> option, the interval is mapped onto a
+      standard <ulink url="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</ulink>
+      (the classical bell-shaped gaussian curve) truncated at
+      <literal>-threshold</> on the left and <literal>+threshold</>
+      on the right.
+      <!-- formula -->
+      To be precise, if <literal>PHI(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability:
+      <literal>
+        (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * PHI(threshold) - 1.0)
+      </>
+      <!-- intuition -->
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      <!-- rule of thumb -->
+      With a gaussian distribution, about 67% of values are drawn from
+      the middle  <literal>1.0 / threshold</> and 95% in the middle
+      <literal>2.0 / threshold</>.
+      <!-- constraint -->
+      The minimum <replaceable>threshold</> is 2.0 for performance of
+      the Box-Muller transform.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      With the <literal>exponential</> option, the <replaceable>threshold</>
+      parameter controls the distribution by truncating a quickly-decreasing
+      <ulink url="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</ulink>
+      at <replaceable>threshold</>, and then projecting onto integers between
+      the bounds.
+      <!-- formula -->
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      <!-- intuition -->
+      Intuitively, the larger the <replaceable>threshold</>, the more
+      frequently values close to <replaceable>min</> are accessed, and the
+      less frequently values close to <replaceable>max</> are accessed.
+      The closer to 0 the threshold, the flatter (more uniform) the access
+      distribution.
+      <!-- rule of thumb -->
+      A crude approximation of the distribution is that the most frequent 1%
+      values in the range, close to <replaceable>min</>, are drawn
+      <replaceable>threshold</>%  of the time.
+      <!-- constraint -->
+      The <replaceable>threshold</> value must be strictly positive with the
+      <literal>exponential</> option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

gauss_B_2.patchtext/x-diff; name=gauss_B_2.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index a80c0a5..6622d5b 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -174,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -295,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -313,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -329,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -378,6 +383,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -477,36 +484,36 @@ getrand(TState *thread, int64 min, int64 max)
 /*
  * random number generator: exponential distribution from min to max inclusive.
  * the threshold is so that the density of probability for the last cut-off max
- * value is exp(-exp_threshold).
+ * value is exp(-threshold).
  */
 static int64
-getExponentialrand(TState *thread, int64 min, int64 max, double exp_threshold)
+getExponentialrand(TState *thread, int64 min, int64 max, double threshold)
 {
 	double cut, uniform, rand;
-	assert(exp_threshold > 0.0);
-	cut = exp(-exp_threshold);
+	assert(threshold > 0.0);
+	cut = exp(-threshold);
 	/* erand in [0, 1), uniform in (0, 1] */
 	uniform = 1.0 - pg_erand48(thread->random_state);
 	/*
-	 * inner expresion in (cut, 1] (if exp_threshold > 0),
+	 * inner expresion in (cut, 1] (if threshold > 0),
 	 * rand in [0, 1)
 	 */
 	assert((1.0 - cut) != 0.0);
-	rand = - log(cut + (1.0 - cut) * uniform) / exp_threshold;
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
 	/* return int64 random number within between min and max */
 	return min + (int64)((max - min + 1) * rand);
 }
 
 /* random number generator: gaussian distribution from min to max inclusive */
 static int64
-getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+getGaussianrand(TState *thread, int64 min, int64 max, double threshold)
 {
 	double		stdev;
 	double		rand;
 
 	/*
 	 * Get user specified random number from this loop, with
-	 * -stdev_threshold < stdev <= stdev_threshold
+	 * -threshold < stdev <= threshold
 	 *
 	 * This loop is executed until the number is in the expected range.
 	 *
@@ -535,10 +542,10 @@ getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
 		 * value fails the test? To be on the safe side, let us try over.
 		 */
 	}
-	while (stdev < -stdev_threshold || stdev >= stdev_threshold);
+	while (stdev < -threshold || stdev >= threshold);
 
 	/* stdev is in [-threshold, threshold), normalization to [0,1) */
-	rand = (stdev + stdev_threshold) / (stdev_threshold * 2.0);
+	rand = (stdev + threshold) / (threshold * 2.0);
 
 	/* return int64 random number within between min and max */
 	return min + (int64)((max - min + 1) * rand);
@@ -2330,6 +2337,18 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2341,7 +2360,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2357,8 +2376,45 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 2; i <= 20; i = i + 2)
+			printf(" %.1f%%", (double) 50 * (erf (threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+				erf (threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+				erf (threshold / sqrt(2.0)));
+		printf("\n");
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2489,6 +2545,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2769,6 +2827,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2966,6 +3043,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2988,25 +3076,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
-	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+	if (ttype < 3)
+	{
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index d6c49d4..d217f90 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,44 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#20

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Fabien COELHO (#19)

2 attachment(s)

Please find attached 2 patches, which are a split of the patch discussed in
this thread.

Please find attached a very minor improvement to apply a code (variable
name) simplification directly in patch A so as to avoid a change in patch
B. The cumulated patch is the same as previous.

(A) add gaussian & exponential options to pgbench \setrandom
the patch includes sql test files.

There is no change in the *code* from previous already reviewed submissions,
so I do not think that it needs another review on that account.

However I have (yet again) reworked the *documentation* (for Andres Freund &
Robert Haas), in particular both descriptions now follow the same structure
(introduction, formula, intuition, rule of thumb and constraint). I have
differentiated the concept and the option by putting the later in <literal>
tags, and added a link to the corresponding wikipedia pages.

Please bear in mind that:
1. English is not my native language.
2. this is not easy reading... this is maths, to read slowly:-)
3. word smithing contributions are welcome.

I assume somehow that a user interested in gaussian & exponential
distributions must know a little bit about probabilities...

(B) add pgbench test variants with gauss & exponential.

I have reworked the patch so as to avoid copy pasting the 3 test cases, as
requested by Andres Freund, thus this is new, although quite simple, code. I
have also added explanations in the documentation about how to interpret the
"decile" outputs, so as to hopefully address Robert Haas comments.

--
Fabien.

Attachments:

gauss_A_3.patchtext/x-diff; name=gauss_A_3.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 1000000 -P 1 -n test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..379ef24 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialrand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	assert(threshold > 0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold > 0),
+	 * rand in [0, 1)
+	 */
+	assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianrand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold < stdev <= threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+ 		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test? To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -threshold || stdev >= threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1392,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1438,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1451,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianrand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialrand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialrand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2047,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..7d56117
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION PHI(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (PHI(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          PHI(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * PHI(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..d6c49d4 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,75 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      The default random distribution is <literal>uniform</>, that is all
+      values in the range are drawn with equal probability.
+      The <literal>gaussian</> and <literal>exponential</>  options allow to
+      change this default, with a mandatory <replaceable>threshold</> double
+      value to control the actual distribution.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      With the <literal>gaussian</> option, the interval is mapped onto a
+      standard <ulink url="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</ulink>
+      (the classical bell-shaped gaussian curve) truncated at
+      <literal>-threshold</> on the left and <literal>+threshold</>
+      on the right.
+      <!-- formula -->
+      To be precise, if <literal>PHI(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability:
+      <literal>
+        (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * PHI(threshold) - 1.0)
+      </>
+      <!-- intuition -->
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      <!-- rule of thumb -->
+      With a gaussian distribution, about 67% of values are drawn from
+      the middle  <literal>1.0 / threshold</> and 95% in the middle
+      <literal>2.0 / threshold</>.
+      <!-- constraint -->
+      The minimum <replaceable>threshold</> is 2.0 for performance of
+      the Box-Muller transform.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      With the <literal>exponential</> option, the <replaceable>threshold</>
+      parameter controls the distribution by truncating a quickly-decreasing
+      <ulink url="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</ulink>
+      at <replaceable>threshold</>, and then projecting onto integers between
+      the bounds.
+      <!-- formula -->
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      <!-- intuition -->
+      Intuitively, the larger the <replaceable>threshold</>, the more
+      frequently values close to <replaceable>min</> are accessed, and the
+      less frequently values close to <replaceable>max</> are accessed.
+      The closer to 0 the threshold, the flatter (more uniform) the access
+      distribution.
+      <!-- rule of thumb -->
+      A crude approximation of the distribution is that the most frequent 1%
+      values in the range, close to <replaceable>min</>, are drawn
+      <replaceable>threshold</>%  of the time.
+      <!-- constraint -->
+      The <replaceable>threshold</> value must be strictly positive with the
+      <literal>exponential</> option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

gauss_B_3.patchtext/x-diff; name=gauss_B_3.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 379ef24..6622d5b 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -174,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -295,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -313,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -329,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -378,6 +383,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -2330,6 +2337,18 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2341,7 +2360,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2357,8 +2376,45 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 2; i <= 20; i = i + 2)
+			printf(" %.1f%%", (double) 50 * (erf (threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+				erf (threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+				erf (threshold / sqrt(2.0)));
+		printf("\n");
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2489,6 +2545,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2769,6 +2827,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2966,6 +3043,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2988,25 +3076,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
-	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+	if (ttype < 3)
+	{
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index d6c49d4..d217f90 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,44 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#21

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#14)

Re: gaussian distribution pgbench -- part 1/2

On Thu, Jul 17, 2014 at 12:09 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

pgbench with gaussian & exponential, part 1 of 2.

This patch is a subset of the previous patch which only adds the two
new \setrandom gaussian and exponantial variants, but not the
adapted pgbench test cases, as suggested by Fujii Masao.
There is no new code nor code changes.

The corresponding documentation has been yet again extended wrt
to the initial patch, so that what is achieved is hopefully unambiguous
(there are two mathematical formula, tasty!), in answer to Andres Freund
comments, and partly to Robert Haas comments as well.

This patch also provides several sql/pgbench scripts and a README, so
that the feature can be tested. I do not know whether these scripts
should make it to postgresql. I would say yes, otherwise there is no way
to test...

part 2 which provide adapted pgbench test cases will come later.

Some review comments:

1. I suggest that getExponentialrand and getGaussianrand be renamed to
getExponentialRand and getGaussianRand.

2. I suggest that the code be changed so that the branch currently
labeled as /* uniform with extra argument */ become a hard error
instead of a warning.

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc < 6 OR argc > 6. I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.

4. This question mark seems like it should be a period:

+ * value fails the test? To be on the safe side, let
us try over.

5. With regards to the following paragraph:

      <para>
+      The default random distribution is uniform, that is all values in the
+      range are drawn with equal probability. The gaussian and exponential
+      options allow to change this default. The mandatory
+      <replaceable>threshold</> double value controls the actual distribution
+      with gaussian or exponential.
+     </para>

This paragraph needs a bit of copy-editing. Here's an attempt: "By
default, all values in the range are drawn with equal probability.
The <literal>gaussian</> and <literal>exponential</> options modify
this behavior; each requires a mandatory threshold which determines
the precise shape of the distribution." The following paragraph
should be changed to begin with "For a Gaussian distribution" and the
one after "For an exponential distribution".

6. Overall, I think the documentation here looks much better now, but
I suggest adding one or two example to the Gaussian section. Like
this: for example, if threshold is 2.0, 68% of the values will fall in
the middle third of the interval; with a threshold of 3.0, 99.7% of
the values will fall in the middle third of the interval. These
numbers are fabricated, and the middle third of the interval might not
be the best part to talk about, but you get the idea (I hope).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#21)

2 attachment(s)

Re: gaussian distribution pgbench -- splits v4

Hello Robert,

Some review comments:

Thanks a lot for your return.

Please find attached two new parts of the patch (A for setrandom
extension, B for pgbench embedded test case extension).

1. I suggest that getExponentialrand and getGaussianrand be renamed to
getExponentialRand and getGaussianRand.

Done.

It was named like that because "getrand" was used for the uniform case.

2. I suggest that the code be changed so that the branch currently
labeled as /* uniform with extra argument */ become a hard error
instead of a warning.

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc < 6 OR argc > 6. I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.

I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.

4. This question mark seems like it should be a period:

+ * value fails the test? To be on the safe side, let us try over.

Indeed.

5. With regards to the following paragraph:
<para>
+      The default random distribution is uniform, that is all values in the
+      range are drawn with equal probability. The gaussian and exponential
+      options allow to change this default. The mandatory
+      <replaceable>threshold</> double value controls the actual distribution
+      with gaussian or exponential.
+     </para>
This paragraph needs a bit of copy-editing. Here's an attempt: "By
default, all values in the range are drawn with equal probability.
The <literal>gaussian</> and <literal>exponential</> options modify
this behavior; each requires a mandatory threshold which determines
the precise shape of the distribution." The following paragraph
should be changed to begin with "For a Gaussian distribution" and the
one after "For an exponential distribution".

Ok. I've kept "uniform" in the first sentence, because this is both
an option name and it is significant in term of probabilities.

6. Overall, I think the documentation here looks much better now, but
I suggest adding one or two example to the Gaussian section. Like
this: for example, if threshold is 2.0, 68% of the values will fall in
the middle third of the interval; with a threshold of 3.0, 99.7% of
the values will fall in the middle third of the interval. These
numbers are fabricated, and the middle third of the interval might not
be the best part to talk about, but you get the idea (I hope).

Done with threshold value 4.0 so I have a "middle quarter" and a "middle
half".

--
Fabien.

Attachments:

gauss_A_4.patchtext/x-diff; name=gauss_A_4.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 1000000 -P 1 -n test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold > 0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold > 0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold < stdev <= threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -threshold || stdev >= threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1450,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianRand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialRand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2046,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..7d56117
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION PHI(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (PHI(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          PHI(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * PHI(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..276476a 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,76 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      By default, all values in the range are drawn with equal probability,
+      that is the distribution is <literal>uniform</>.
+      The <literal>gaussian</> and <literal>exponential</> options modify
+      this behavior; each requires a mandatory threshold which determines
+      the precise shape of the distribution.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For a Gaussian distribution, the interval is mapped onto a standard
+      <ulink url="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</ulink>
+      (the classical bell-shaped Gaussian curve) truncated at
+      <literal>-threshold</> on the left and <literal>+threshold</>
+      on the right.
+      <!-- formula -->
+      To be precise, if <literal>PHI(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability:
+      <literal>
+        (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * PHI(threshold) - 1.0)
+      </>
+      <!-- intuition -->
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      <!-- rule of thumb -->
+      For a Gaussian distribution, about 67% of values are drawn from the middle
+      <literal>1.0 / threshold</> and 95% in the middle <literal>2.0 / threshold</>,
+      thus if <replaceable>threshold</> is 4.0, 67% of values are drawn from the middle
+      quarter and 95% from the middle half of the interval.
+      <!-- constraint -->
+      The minimum <replaceable>threshold</> is 2.0 for performance of
+      the Box-Muller transform.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For an exponential distribution, the <replaceable>threshold</>
+      parameter controls the distribution by truncating a quickly-decreasing
+      <ulink url="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</ulink>
+      at <replaceable>threshold</>, and then projecting onto integers between
+      the bounds.
+      <!-- formula -->
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      <!-- intuition -->
+      Intuitively, the larger the <replaceable>threshold</>, the more
+      frequently values close to <replaceable>min</> are accessed, and the
+      less frequently values close to <replaceable>max</> are accessed.
+      The closer to 0 the threshold, the flatter (more uniform) the access
+      distribution.
+      <!-- rule of thumb -->
+      A crude approximation of the distribution is that the most frequent 1%
+      values in the range, close to <replaceable>min</>, are drawn
+      <replaceable>threshold</>%  of the time.
+      <!-- constraint -->
+      The <replaceable>threshold</> value must be strictly positive with the
+      <literal>exponential</> option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

gauss_B_4.patchtext/x-diff; name=gauss_B_4.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e07206a..e8c1888 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -173,6 +173,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -294,11 +299,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -312,11 +317,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -328,9 +333,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -377,6 +382,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -2329,6 +2336,18 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2340,7 +2359,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2356,8 +2375,45 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 2; i <= 20; i = i + 2)
+			printf(" %.1f%%", (double) 50 * (erf (threshold * (1 - 0.1 * (i - 2)) / sqrt(2.0)) -
+				erf (threshold * (1 - 0.1 * i) / sqrt(2.0))) /
+				erf (threshold / sqrt(2.0)));
+		printf("\n");
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2488,6 +2544,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2768,6 +2826,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2965,6 +3042,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2987,25 +3075,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
-	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+	if (ttype < 3)
+	{
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 276476a..fcdde32 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,44 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#23

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#22)

1 attachment(s)

Re: gaussian distribution pgbench -- splits v4

Hi,

Thank you for your grate documentation and fix working!!!
It becomes very helpful for understanding our feature.

I add two feature in gauss_B_4.patch.

1) Add gaussianProbability() function
It is same as exponentialProbability(). And the feature is as same as
before.

2) Add result of "max/min percent of the range"
It is almost same as --exponential option's result. However, max percent of
the range is center of distribution
and min percent of the range is most side of distribution.
Here is the output example,

+ pgbench_account's aid selected with a truncated gaussian distribution

+ standard deviation threshold: 5.00000

+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

+ probability of max/min percent of the range: 4.0% 0.0%

And I add the explanation about this in the document.

I'm very appreciate for your works!!!

Best regards,

Mitsumasa KONDO

Attachments:

gauss_B_5.patchapplication/octet-stream; name=gauss_B_5.patchDownload

*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***************
*** 41,46 ****
--- 41,47 ----
  #include <math.h>
  #include <signal.h>
  #include <sys/time.h>
+ #include <assert.h>
  #ifdef HAVE_SYS_SELECT_H
  #include <sys/select.h>
  #endif
***************
*** 157,162 **** char	   *index_tablespace = NULL;
--- 158,164 ----
   * document and remember, and isn't that far away from the real threshold.
   */
  #define SCALE_32BIT_THRESHOLD 20000
+ #define MIN_GAUSSIAN_THRESHOLD 2.0
  
  bool		use_log;			/* log transaction latencies to a file */
  bool		use_quiet;			/* quiet logging onto stderr */
***************
*** 171,176 **** bool		is_connect;			/* establish connection for each transaction */
--- 173,183 ----
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian/exponential distribution tests */
+ double		threshold;          /* threshold for gaussian or exponential */
+ bool        use_gaussian = false;
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***************
*** 292,302 **** static int	num_commands = 0;	/* total number of Command structs */
  static int	debug = 0;			/* debug flag */
  
  /* default scenario */
! static char *tpc_b = {
  	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
  	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts\n"
  	"\\setrandom bid 1 :nbranches\n"
  	"\\setrandom tid 1 :ntellers\n"
  	"\\setrandom delta -5000 5000\n"
--- 299,309 ----
  static int	debug = 0;			/* debug flag */
  
  /* default scenario */
! static char *tpc_b_fmt = {
  	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
  	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts%s\n"
  	"\\setrandom bid 1 :nbranches\n"
  	"\\setrandom tid 1 :ntellers\n"
  	"\\setrandom delta -5000 5000\n"
***************
*** 310,320 **** static char *tpc_b = {
  };
  
  /* -N case */
! static char *simple_update = {
  	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
  	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts\n"
  	"\\setrandom bid 1 :nbranches\n"
  	"\\setrandom tid 1 :ntellers\n"
  	"\\setrandom delta -5000 5000\n"
--- 317,327 ----
  };
  
  /* -N case */
! static char *simple_update_fmt = {
  	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
  	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts%s\n"
  	"\\setrandom bid 1 :nbranches\n"
  	"\\setrandom tid 1 :ntellers\n"
  	"\\setrandom delta -5000 5000\n"
***************
*** 326,334 **** static char *simple_update = {
  };
  
  /* -S case */
! static char *select_only = {
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts\n"
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
--- 333,341 ----
  };
  
  /* -S case */
! static char *select_only_fmt = {
  	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
! 	"\\setrandom aid 1 :naccounts%s\n"
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
***************
*** 375,380 **** usage(void)
--- 382,389 ----
  		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
  		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
  		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+ 		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+ 		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
  		   "\nCommon options:\n"
  		   "  -d, --debug              print debugging output\n"
  	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
***************
*** 2178,2183 **** process_builtin(char *tb)
--- 2187,2216 ----
  	return my_commands;
  }
  
+ /*
+  * compute the probability of the truncated gaussian random generation
+  * to draw values in the i-th slot of the range.
+  */
+ static double gaussianProbability(int i, int slots, double threshold)
+ {
+ 	assert(1 <= i && i <= slots);
+ 	return (0.50 * (erf (threshold * (1.0 - 1.0 / slots * (2.0 * i - 2.0)) / sqrt(2.0)) -
+ 		erf (threshold * (1.0 - 1.0 / slots * 2.0 * i) / sqrt(2.0))) /
+ 		erf (threshold / sqrt(2.0)));
+ }
+ 
+ /*
+  * compute the probability of the truncated exponential random generation
+  * to draw values in the i-th slot of the range.
+  */
+ static double exponentialProbability(int i, int slots, double threshold)
+ {
+ 	assert(1 <= i && i <= slots);
+ 	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+ 		(1.0 - exp(- threshold));
+ }
+ 
+ 
  /* print out results */
  static void
  printResults(int ttype, int64 normal_xacts, int nclients,
***************
*** 2189,2195 **** printResults(int ttype, int64 normal_xacts, int nclients,
  	double		time_include,
  				tps_include,
  				tps_exclude;
! 	char	   *s;
  
  	time_include = INSTR_TIME_GET_DOUBLE(total_time);
  	tps_include = normal_xacts / time_include;
--- 2222,2228 ----
  	double		time_include,
  				tps_include,
  				tps_exclude;
! 	char	   *s, *d;
  
  	time_include = INSTR_TIME_GET_DOUBLE(total_time);
  	tps_include = normal_xacts / time_include;
***************
*** 2205,2212 **** printResults(int ttype, int64 normal_xacts, int nclients,
  	else
  		s = "Custom query";
  
! 	printf("transaction type: %s\n", s);
  	printf("scaling factor: %d\n", scale);
  	printf("query mode: %s\n", QUERYMODE[querymode]);
  	printf("number of clients: %d\n", nclients);
  	printf("number of threads: %d\n", nthreads);
--- 2238,2283 ----
  	else
  		s = "Custom query";
  
! 	if (use_gaussian)
! 		d = "Gaussian distribution ";
! 	else if (use_exponential)
! 		d = "Exponential distribution ";
! 	else
! 		d = ""; /* default uniform case */
! 
! 	printf("transaction type: %s%s\n", d, s);
  	printf("scaling factor: %d\n", scale);
+ 
+ 	/* output in gaussian distribution benchmark */
+ 	if (use_gaussian)
+ 	{
+ 		int i;
+ 		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+ 		printf("standard deviation threshold: %.5f\n", threshold);
+ 		printf("decile percents:");
+ 		for (i = 1; i <= 10; i++)
+ 			printf(" %.1f%%", 100.0 * gaussianProbability(i, 10.0, threshold));
+ 		printf("\n");
+ 		printf("probability of max/min percent of the range: %.1f%% %.1f%%\n",
+ 			  100.0 * gaussianProbability(50, 100, threshold),
+ 			  100.0 * gaussianProbability(100, 100, threshold));
+ 	}
+ 	/* output in exponential distribution benchmark */
+ 	else if (use_exponential)
+ 	{
+ 		int i;
+ 		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+ 		printf("exponential threshold: %.5f\n", threshold);
+ 		printf("decile percents:");
+ 		for (i = 1; i <= 10; i++)
+ 			printf(" %.1f%%",
+ 				   100.0 * exponentialProbability(i, 10, threshold));
+ 		printf("\n");
+ 		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+ 			   100.0 * exponentialProbability(1, 100, threshold),
+ 			   100.0 * exponentialProbability(100, 100, threshold));
+ 	}
+ 
  	printf("query mode: %s\n", QUERYMODE[querymode]);
  	printf("number of clients: %d\n", nclients);
  	printf("number of threads: %d\n", nthreads);
***************
*** 2337,2342 **** main(int argc, char **argv)
--- 2408,2415 ----
  		{"unlogged-tables", no_argument, &unlogged_tables, 1},
  		{"sampling-rate", required_argument, NULL, 4},
  		{"aggregate-interval", required_argument, NULL, 5},
+ 		{"gaussian", required_argument, NULL, 6},
+ 		{"exponential", required_argument, NULL, 7},
  		{"rate", required_argument, NULL, 'R'},
  		{NULL, 0, NULL, 0}
  	};
***************
*** 2617,2622 **** main(int argc, char **argv)
--- 2690,2714 ----
  				}
  #endif
  				break;
+ 			case 6:
+ 				use_gaussian = true;
+ 				threshold = atof(optarg);
+ 				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+ 				{
+ 					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+ 							MIN_GAUSSIAN_THRESHOLD, threshold);
+ 					exit(1);
+ 				}
+ 				break;
+ 			case 7:
+ 				use_exponential = true;
+ 				threshold = atof(optarg);
+ 				if(threshold <= 0.0)
+ 				{
+ 					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+ 					exit(1);
+ 				}
+ 				break;
  			default:
  				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
  				exit(1);
***************
*** 2814,2819 **** main(int argc, char **argv)
--- 2906,2922 ----
  		}
  	}
  
+ 	/* set :threshold variable */
+ 	if(getVariable(&state[0], "threshold") == NULL)
+ 	{
+ 		snprintf(val, sizeof(val), "%lf", threshold);
+ 		for (i = 0; i < nclients; i++)
+ 		{
+ 			if (!putVariable(&state[i], "startup", "threshold", val))
+ 				exit(1);
+ 		}
+ 	}
+ 
  	if (!is_no_vacuum)
  	{
  		fprintf(stderr, "starting vacuum...");
***************
*** 2836,2860 **** main(int argc, char **argv)
  	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
  
  	/* process builtin SQL scripts */
! 	switch (ttype)
  	{
! 		case 0:
! 			sql_files[0] = process_builtin(tpc_b);
! 			num_files = 1;
! 			break;
! 
! 		case 1:
! 			sql_files[0] = process_builtin(select_only);
! 			num_files = 1;
! 			break;
! 
! 		case 2:
! 			sql_files[0] = process_builtin(simple_update);
! 			num_files = 1;
! 			break;
! 
! 		default:
! 			break;
  	}
  
  	/* set up thread data structures */
--- 2939,2962 ----
  	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
  
  	/* process builtin SQL scripts */
! 	if (ttype < 3)
  	{
! 		char *fmt, *distribution, *queries;
! 		int ret;
! 		fmt = (ttype == 0)? tpc_b_fmt:
! 			  (ttype == 1)? select_only_fmt:
! 			  (ttype == 2)? simple_update_fmt: NULL;
! 		assert(fmt != NULL);
! 		distribution =
! 			use_gaussian? " gaussian :threshold":
! 			use_exponential? " exponential :threshold":
! 			"" /* default uniform case */ ;
! 		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
! 		ret = sprintf(queries, fmt, distribution);
! 		assert(ret >= 0);
! 		sql_files[0] = process_builtin(queries);
! 		num_files = 1;
! 		pg_free(queries);
  	}
  
  	/* set up thread data structures */
*** a/doc/src/sgml/pgbench.sgml
--- b/doc/src/sgml/pgbench.sgml
***************
*** 307,312 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 307,355 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>--exponential</option><replaceable>threshold</></term>
+       <listitem>
+        <para>
+          Run exponential distribution pgbench test using this threshold parameter.
+          The threshold controls the distribution of access frequency on the
+          <structname>pgbench_accounts</> table.
+          See the <literal>\setrandom</> documentation below for details about
+          the impact of the threshold value.
+          When set, this option applies to all test variants (<option>-N</> for
+          skipping updates, or <option>-S</> for selects).
+        </para>
+ 
+        <para>
+          When run, the output is expanded to show the distribution
+          depending on the <replaceable>threshold</> value:
+ 
+ <screen>
+ ...
+ pgbench_account's aid selected with a truncated exponential distribution
+ exponential threshold: 5.00000
+ decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+ probability of fist/last percent of the range: 4.9% 0.0%
+ ...
+ </screen>
+ 
+          The figures are to be interpreted as follows.
+          If the scaling factor is 10, there are 1,000,000 accounts in
+          <literal>pgbench_accounts</>.
+          The first decile, with <literal>aid</> from 1 to 100,000, is
+          drawn 39.6% of the time, that is about 4 times more than average.
+          The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+          that is 2.4 times more than average.
+          Up to the last decile, from 900,001 to 1,000,000, which is drawn
+          0.4% of the time, well below average.
+          Moreover, the first percent of the range, that is <literal>aid</>
+          from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+          than average, and the last percent, with <literal>aid</>
+          from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-f</option> <replaceable>filename</></term>
        <term><option>--file=</option><replaceable>filename</></term>
        <listitem>
***************
*** 320,325 **** pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
--- 363,411 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>--gaussian</option><replaceable>threshold</></term>
+       <listitem>
+        <para>
+          Run gaussian distribution pgbench test using this threshold parameter.
+          The threshold controls the distribution of access frequency on the
+          <structname>pgbench_accounts</> table.
+          See the <literal>\setrandom</> documentation below for details about
+          the impact of the threshold value.
+          When set, this option applies to all test variants (<option>-N</> for
+          skipping updates, or <option>-S</> for selects).
+        </para>
+ 
+        <para>
+          When run, the output is expanded to show the distribution
+          depending on the <replaceable>threshold</> value:
+ 
+ <screen>
+ ...
+ pgbench_account's aid selected with a truncated gaussian distribution
+ standard deviation threshold: 5.00000
+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+ probability of max/min percent of the range: 4.0% 0.0%
+ ...
+ </screen>
+ 
+          The figures are to be interpreted as follows.
+          If the scaling factor is 10, there are 1,000,000 accounts in
+          <literal>pgbench_accounts</>.
+          The first decile, with <literal>aid</> from 1 to 100,000, is
+          drawn less than 0.1% of the time.
+          The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+          up to the fifth decile, from 400,001 to 500,000, which
+          is drawn 34.1% of the time, about 3.4 times more thn average,
+          and then the gaussian curve is symmetric for the last five deciles.
+          Moreover, the first percent of the range, that is <literal>aid</>
+          from 50,001 to 60,000, is drawn 4.0% of the time, this 4.0 times more
+          than average, and the last percent, with <literal>aid</>
+          from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-j</option> <replaceable>threads</></term>
        <term><option>--jobs=</option><replaceable>threads</></term>
        <listitem>

#24

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Mitsumasa KONDO (#23)

1 attachment(s)

Re: gaussian distribution pgbench -- splits Bv6

Thank you for your grate documentation and fix working!!!
It becomes very helpful for understanding our feature.

Hopefully it will help make it, or part of it, pass through.

I add two feature in gauss_B_4.patch.

1) Add gaussianProbability() function
It is same as exponentialProbability(). And the feature is as same as
before.

Ok, that is better for readability and easy reuse.

2) Add result of "max/min percent of the range"
It is almost same as --exponential option's result. However, max percent of
the range is center of distribution
and min percent of the range is most side of distribution.
Here is the output example,

Ok, good that make it homogeneous with the exponential case.

+ pgbench_account's aid selected with a truncated gaussian distribution
+ standard deviation threshold: 5.00000
+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+ probability of max/min percent of the range: 4.0% 0.0%

And I add the explanation about this in the document.

This is a definite improvement. I tested these minor changes and
everything seems ok.

Attached is a very small update. One word removed from the doc, and one
redundant declaration removed from the code.

I also have a problem with assert & Assert. I finally figured out that
Assert is not compiled in by default, thus it is generally ignored. So it
is more for debugging purposes when activated than for guarding against
some unexpected user errors.

--
Fabien.

Attachments:

gauss_B_6.patchtext/x-diff; name=gauss_B_6.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e07206a..0247a05 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -173,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -294,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -312,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -328,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -377,6 +383,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -2329,6 +2337,30 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated gaussian random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double gaussianProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (0.50 * (erf (threshold * (1.0 - 1.0 / slots * (2.0 * i - 2.0)) / sqrt(2.0)) -
+		erf (threshold * (1.0 - 1.0 / slots * 2.0 * i) / sqrt(2.0))) /
+		erf (threshold / sqrt(2.0)));
+}
+
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2340,7 +2372,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2356,8 +2388,46 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%", 100.0 * gaussianProbability(i, 10.0, threshold));
+		printf("\n");
+		printf("probability of max/min percent of the range: %.1f%% %.1f%%\n",
+			  100.0 * gaussianProbability(50, 100, threshold),
+			  100.0 * gaussianProbability(100, 100, threshold));
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2488,6 +2558,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2768,6 +2840,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2965,6 +3056,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2987,25 +3089,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
-	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+	if (ttype < 3)
+	{
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 276476a..b718dd3 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+probability of max/min percent of the range: 4.0% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 50,001 to 60,000, is drawn 4.0% of the time, 4.0 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#25

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Fabien COELHO (#24)

Re: gaussian distribution pgbench -- splits Bv6

Fabien COELHO wrote:

I also have a problem with assert & Assert. I finally figured out
that Assert is not compiled in by default, thus it is generally
ignored. So it is more for debugging purposes when activated than
for guarding against some unexpected user errors.

Yes, Assert() is for debugging during development. If you need to deal
with user error, use regular if () and exit() as appropriate (ereport()
in the backend). We mostly avoid assert() in our own code.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#24)

Re: gaussian distribution pgbench -- splits Bv6

Thanks for your modify the patch! I confirmed that It seems to be fine.

I think that our latest patch fill all community comment.
So it is really ready for committer now.

Best regards,
--
Mitsumasa KONDO

#27

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Fabien COELHO (#16)

On 07/17/2014 11:13 PM, Fabien COELHO wrote:

However, ISTM that it is not the purpose of pgbench documentation to be a
primer about what is an exponential or gaussian distribution, so the idea
would yet be to have a relatively compact explanation, and that the
interested but clueless reader would document h..self from wikipedia or a
text book or a friend or a math teacher (who could be a friend as well:-).

Well, I think it's a balance. I agree that the pgbench documentation
shouldn't try to substitute for a text book or a math teacher, but I
also think that you shouldn't necessarily need to refer to a text book
or a math teacher in order to figure out how to use pgbench. Saying
"it's complicated, so we don't have to explain it" would be a cop out;
we need to *make* it simple. And if there's no way to do that, then
IMHO we should reject the patch in favor of some future patch that
implements something that will be easy for users to understand.

[nttcom@localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.00000

decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
highest/lowest percent of the range: 9.5% 0.0%

I don't have a clue what that means. None.

Maybe we could add in front of the decile/percent

"distribution of increasing account key values selected by pgbench:"

I still wouldn't know what that meant. And it misses the point
anyway: if the documentation is good, this will be unnecessary. If
the documentation is bad, a printout that tries to illustrate it by
example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

IMHO we should include a diagram for each distribution. A diagram would
be much more easy to understand than a decile or verbal explanation.

The only problem is that the build infrastructure doesn't currently
support including images in the docs. That's been discussed before, and
I think we even used to have a couple of images there a long time ago.
Now would be a good time to bite the bullet and add the support.
We got fairly close to a consensus on how to do it in this thread:
www.postgresql.org/message-id/flat/20120712181636.GC11063@momjian.us.
The biggest problem was choosing an editor that has a fairly stable file
format, so that we don't get huge diffs every time someone moves a line
in a diagram. One work-around for that is to use graphviz and/or gnuplot
as the source format, instead of a graphical editor.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#22)

Re: gaussian distribution pgbench -- splits v4

On Wed, Jul 23, 2014 at 12:39 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc < 6 OR argc > 6. I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.

I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.

I don't agree. I'm not in a huge hurry to fix all the places where
pgbench currently lacks error checks just because I don't have enough
to do (hint: I do have enough to do), but when we're adding more
complicated syntax in one particular place, bringing the error checks
in that portion of the code up to scratch is an eminently sensible
thing to do, and we should do it.

Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#28)

Re: gaussian distribution pgbench -- splits v4

Hello Robert,

I wish to agree, but my interpretation of the previous code is that
they were ignored before, so ISTM that we are stuck with keeping the
same unfortunate behavior.

I don't agree. I'm not in a huge hurry to fix all the places where
pgbench currently lacks error checks just because I don't have enough to
do (hint: I do have enough to do), but when we're adding more
complicated syntax in one particular place, bringing the error checks in
that portion of the code up to scratch is an eminently sensible thing to
do, and we should do it.

Ok. I'm in favor of that anyway. It is just that was afraid that changing
behavior, however poor the said behavior, could be a blocker.

Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.

Sorry. It does not break my mailer which relies on internal headers, but
I'll try to be compatible with this gmail "features" in the future.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#28)

2 attachment(s)

Re: gaussian distribution pgbench -- splits v4

Hello Robert,

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc < 6 OR argc > 6. I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.

I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.

I don't agree.

Attached B patch does turn incorrect setrandom syntax into errors instead
of ignoring extra parameters.

First A patch is repeated to help commitfest references.

--
Fabien.

Attachments:

gauss_A_4.patchtext/x-diff; name=gauss_A_4.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 1000000 -P 1 -n test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold > 0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold > 0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold < stdev <= threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -threshold || stdev >= threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1450,63 @@ top:
 				return true;
 			}
 
+			if (argc == 4) /* uniform */
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianRand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialRand(thread, min, max, threshold));
+				}
+			}
+			else /* uniform with extra arguments */
+			{
+#ifdef DEBUG
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+#endif
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1920,9 +2046,34 @@ process_commands(char *buf)
 				exit(1);
 			}
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 ) /* uniform */
+			{
+				/* nothing to do */
+			}
+			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+
+				for (j = 6; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
+			}
+			else /* uniform with extra argument */
+			{
+				int arg_pos = 4;
+
+				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
+					arg_pos++;
+
+				for (j = arg_pos; j < my_commands->argc; j++)
+					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
+							my_commands->argv[0], my_commands->argv[j]);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..7d56117
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION PHI(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (PHI(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          PHI(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * PHI(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..276476a 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,76 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      By default, all values in the range are drawn with equal probability,
+      that is the distribution is <literal>uniform</>.
+      The <literal>gaussian</> and <literal>exponential</> options modify
+      this behavior; each requires a mandatory threshold which determines
+      the precise shape of the distribution.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For a Gaussian distribution, the interval is mapped onto a standard
+      <ulink url="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</ulink>
+      (the classical bell-shaped Gaussian curve) truncated at
+      <literal>-threshold</> on the left and <literal>+threshold</>
+      on the right.
+      <!-- formula -->
+      To be precise, if <literal>PHI(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability:
+      <literal>
+        (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * PHI(threshold) - 1.0)
+      </>
+      <!-- intuition -->
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      <!-- rule of thumb -->
+      For a Gaussian distribution, about 67% of values are drawn from the middle
+      <literal>1.0 / threshold</> and 95% in the middle <literal>2.0 / threshold</>,
+      thus if <replaceable>threshold</> is 4.0, 67% of values are drawn from the middle
+      quarter and 95% from the middle half of the interval.
+      <!-- constraint -->
+      The minimum <replaceable>threshold</> is 2.0 for performance of
+      the Box-Muller transform.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For an exponential distribution, the <replaceable>threshold</>
+      parameter controls the distribution by truncating a quickly-decreasing
+      <ulink url="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</ulink>
+      at <replaceable>threshold</>, and then projecting onto integers between
+      the bounds.
+      <!-- formula -->
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      <!-- intuition -->
+      Intuitively, the larger the <replaceable>threshold</>, the more
+      frequently values close to <replaceable>min</> are accessed, and the
+      less frequently values close to <replaceable>max</> are accessed.
+      The closer to 0 the threshold, the flatter (more uniform) the access
+      distribution.
+      <!-- rule of thumb -->
+      A crude approximation of the distribution is that the most frequent 1%
+      values in the range, close to <replaceable>min</>, are drawn
+      <replaceable>threshold</>%  of the time.
+      <!-- constraint -->
+      The <replaceable>threshold</> value must be strictly positive with the
+      <literal>exponential</> option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

gauss_B_7.patchtext/x-diff; name=gauss_B_7.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e07206a..685ff03 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -173,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -294,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -312,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -328,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -377,6 +383,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -1450,15 +1458,17 @@ top:
 				return true;
 			}
 
-			if (argc == 4) /* uniform */
+			if (argc == 4 || /* uniform without or with "uniform" keyword */
+				(argc == 5 && pg_strcasecmp(argv[4], "uniform") == 0))
 			{
 #ifdef DEBUG
 				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
 				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
 			}
-			else if ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
-				 (pg_strcasecmp(argv[4], "exponential") == 0))
+			else if (argc == 6 &&
+					 ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+					  (pg_strcasecmp(argv[4], "exponential") == 0)))
 			{
 				if (*argv[5] == ':')
 				{
@@ -1500,12 +1510,11 @@ top:
 					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialRand(thread, min, max, threshold));
 				}
 			}
-			else /* uniform with extra arguments */
+			else /* this means an error somewhere in the parsing phase... */
 			{
-#ifdef DEBUG
-				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
-#endif
-				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				fprintf(stderr, "%s: unexpected arguments\n", argv[0]);
+				st->ecnt++;
+				return true;
 			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
@@ -2040,39 +2049,50 @@ process_commands(char *buf)
 
 		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
 		{
+			/* parsing:
+			 * \setrandom variable min max [uniform]
+			 * \setrandom variable min max (gaussian|exponential) threshold
+			 */
+
 			if (my_commands->argc < 4)
 			{
 				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
 				exit(1);
 			}
+			/* argc >= 4 */
 
-			if (my_commands->argc == 4 ) /* uniform */
+			if (my_commands->argc == 4 || /* uniform without/with "uniform" keyword */
+				(my_commands->argc == 5 &&
+				 pg_strcasecmp(my_commands->argv[4], "uniform") == 0))
 			{
 				/* nothing to do */
 			}
-			else if ((pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
-				 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			else if (/* argc >= 5 */
+					 (pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+					 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
 			{
 				if (my_commands->argc < 6)
 				{
-					fprintf(stderr, "%s(%s): missing argument\n", my_commands->argv[0], my_commands->argv[4]);
+					fprintf(stderr, "%s(%s): missing threshold argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+				else if (my_commands->argc > 6)
+				{
+					fprintf(stderr, "%s(%s): too many arguments (extra:",
+							my_commands->argv[0], my_commands->argv[4]);
+					for (j = 6; j < my_commands->argc; j++)
+						fprintf(stderr, " %s", my_commands->argv[j]);
+					fprintf(stderr, ")\n");
 					exit(1);
 				}
-
-				for (j = 6; j < my_commands->argc; j++)
-					fprintf(stderr, "%s(%s): extra argument \"%s\" ignored\n",
-							my_commands->argv[0], my_commands->argv[4], my_commands->argv[j]);
 			}
-			else /* uniform with extra argument */
+			else /* cannot parse, unexpected arguments */
 			{
-				int arg_pos = 4;
-
-				if (pg_strcasecmp(my_commands->argv[4], "uniform") == 0)
-					arg_pos++;
-
-				for (j = arg_pos; j < my_commands->argc; j++)
-					fprintf(stderr, "%s(uniform): extra argument \"%s\" ignored\n",
-							my_commands->argv[0], my_commands->argv[j]);
+				fprintf(stderr, "%s: unexpected arguments (bad:", my_commands->argv[0]);
+				for (j = 4; j < my_commands->argc; j++)
+					fprintf(stderr, " %s", my_commands->argv[j]);
+				fprintf(stderr, ")\n");
+				exit(1);
 			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
@@ -2329,6 +2349,30 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated gaussian random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double gaussianProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (0.50 * (erf (threshold * (1.0 - 1.0 / slots * (2.0 * i - 2.0)) / sqrt(2.0)) -
+		erf (threshold * (1.0 - 1.0 / slots * 2.0 * i) / sqrt(2.0))) /
+		erf (threshold / sqrt(2.0)));
+}
+
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2340,7 +2384,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2356,8 +2400,46 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%", 100.0 * gaussianProbability(i, 10.0, threshold));
+		printf("\n");
+		printf("probability of max/min percent of the range: %.1f%% %.1f%%\n",
+			  100.0 * gaussianProbability(50, 100, threshold),
+			  100.0 * gaussianProbability(100, 100, threshold));
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2488,6 +2570,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2768,6 +2852,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2965,6 +3068,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2987,25 +3101,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
-	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+	if (ttype < 3)
+	{
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 276476a..b718dd3 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+probability of max/min percent of the range: 4.0% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 50,001 to 60,000, is drawn 4.0% of the time, 4.0 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#31

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Fabien COELHO (#30)

2 attachment(s)

Re: gaussian distribution pgbench -- splits v4

Attached B patch does turn incorrect setrandom syntax into errors instead of
ignoring extra parameters.

First A patch is repeated to help commitfest references.

Oops, I applied the change on the wrong part:-(

Here is the change on part A which checks setrandom syntax, and B for
completeness.

--
Fabien.

Attachments:

gauss_A_5.patchtext/x-diff; name=gauss_A_5.patchDownload

diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 0000000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as "expo" or "gauss"
+psql test < test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 1000000 -P 1 -n test
+psql test < test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..16e44bd 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold > 0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread->random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold > 0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold < stdev <= threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) <= 2 => r >= e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread->random_state);
+		double rand2 = 1.0 - pg_erand48(thread->random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev < -threshold || stdev >= threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 						max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max > min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max > min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min < 0 || (max - min) + 1 < 0)
 			{
@@ -1377,10 +1450,64 @@ top:
 				return true;
 			}
 
+			if (argc == 4 || /* uniform without or with "uniform" keyword */
+				(argc == 5 && pg_strcasecmp(argv[4], "uniform") == 0))
+			{
 #ifdef DEBUG
-			printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
+				printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getrand(thread, min, max));
 #endif
-			snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+				snprintf(res, sizeof(res), INT64_FORMAT, getrand(thread, min, max));
+			}
+			else if (argc == 6 &&
+					 ((pg_strcasecmp(argv[4], "gaussian") == 0) ||
+					  (pg_strcasecmp(argv[4], "exponential") == 0)))
+			{
+				if (*argv[5] == ':')
+				{
+					if ((var = getVariable(st, argv[5] + 1)) == NULL)
+					{
+						fprintf(stderr, "%s: invalid threshold number %s\n", argv[0], argv[5]);
+						st->ecnt++;
+						return true;
+					}
+					threshold = strtod(var, NULL);
+				}
+				else
+					threshold = strtod(argv[5], NULL);
+
+				if (pg_strcasecmp(argv[4], "gaussian") == 0)
+				{
+					if (threshold < MIN_GAUSSIAN_THRESHOLD)
+					{
+						fprintf(stderr, "%s: gaussian threshold must be more than %f\n,", argv[5], MIN_GAUSSIAN_THRESHOLD);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getGaussianRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getGaussianRand(thread, min, max, threshold));
+				}
+				else if (pg_strcasecmp(argv[4], "exponential") == 0)
+				{
+					if (threshold <= 0.0)
+					{
+						fprintf(stderr, "%s: exponential threshold must be strictly positive\n,", argv[5]);
+						st->ecnt++;
+						return true;
+					}
+#ifdef DEBUG
+					printf("min: " INT64_FORMAT " max: " INT64_FORMAT " random: " INT64_FORMAT "\n", min, max, getExponentialRand(thread, min, max, threshold));
+#endif
+					snprintf(res, sizeof(res), INT64_FORMAT, getExponentialRand(thread, min, max, threshold));
+				}
+			}
+			else /* this means an error somewhere in the parsing phase... */
+			{
+				fprintf(stderr, "%s: unexpected arguments\n", argv[0]);
+				st->ecnt++;
+				return true;
+			}
 
 			if (!putVariable(st, argv[0], argv[1], res))
 			{
@@ -1914,15 +2041,51 @@ process_commands(char *buf)
 
 		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
 		{
+			/* parsing:
+			 * \setrandom variable min max [uniform]
+			 * \setrandom variable min max (gaussian|exponential) threshold
+			 */
+
 			if (my_commands->argc < 4)
 			{
 				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
 				exit(1);
 			}
+			/* argc >= 4 */
 
-			for (j = 4; j < my_commands->argc; j++)
-				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
-						my_commands->argv[0], my_commands->argv[j]);
+			if (my_commands->argc == 4 || /* uniform without/with "uniform" keyword */
+				(my_commands->argc == 5 &&
+				 pg_strcasecmp(my_commands->argv[4], "uniform") == 0))
+			{
+				/* nothing to do */
+			}
+			else if (/* argc >= 5 */
+					 (pg_strcasecmp(my_commands->argv[4], "gaussian") == 0) ||
+					 (pg_strcasecmp(my_commands->argv[4], "exponential") == 0))
+			{
+				if (my_commands->argc < 6)
+				{
+					fprintf(stderr, "%s(%s): missing threshold argument\n", my_commands->argv[0], my_commands->argv[4]);
+					exit(1);
+				}
+				else if (my_commands->argc > 6)
+				{
+					fprintf(stderr, "%s(%s): too many arguments (extra:",
+							my_commands->argv[0], my_commands->argv[4]);
+					for (j = 6; j < my_commands->argc; j++)
+						fprintf(stderr, " %s", my_commands->argv[j]);
+					fprintf(stderr, ")\n");
+					exit(1);
+				}
+			}
+			else /* cannot parse, unexpected arguments */
+			{
+				fprintf(stderr, "%s: unexpected arguments (bad:", my_commands->argv[0]);
+				for (j = 4; j < my_commands->argc; j++)
+					fprintf(stderr, " %s", my_commands->argv[j]);
+				fprintf(stderr, ")\n");
+				exit(1);
+			}
 		}
 		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
 		{
diff --git a/contrib/pgbench/test-expo-check.sql b/contrib/pgbench/test-expo-check.sql
new file mode 100644
index 0000000..fbf35fd
--- /dev/null
+++ b/contrib/pgbench/test-expo-check.sql
@@ -0,0 +1,14 @@
+-- val, min, max, threshold
+CREATE OR REPLACE FUNCTION
+expoProba(INTEGER, INTEGER, INTEGER, DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT (exp(-$4*($1-$2)/($3-$2+1)) - exp(-$4*($1-$2+1)/($3-$2+1))) /
+         (1.0 - exp(-$4));
+$$ LANGUAGE SQL;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), expoProba(id, 0, 99, 10.0)
+FROM pgbench_dist
+ORDER BY id;
+
diff --git a/contrib/pgbench/test-expo-run.sql b/contrib/pgbench/test-expo-run.sql
new file mode 100644
index 0000000..1d476bc
--- /dev/null
+++ b/contrib/pgbench/test-expo-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 exponential 10.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-gauss-check.sql b/contrib/pgbench/test-gauss-check.sql
new file mode 100644
index 0000000..7d56117
--- /dev/null
+++ b/contrib/pgbench/test-gauss-check.sql
@@ -0,0 +1,57 @@
+-- approximation with maximal error of 1.2 10E-07, as told from
+-- https://en.wikipedia.org/wiki/Error_function#Numerical_approximation
+CREATE OR REPLACE FUNCTION erf(x DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  t DOUBLE PRECISION := 1.0 / ( 1.0 + 0.5 * ABS(x));
+  tau DOUBLE PRECISION;
+BEGIN
+  IF ABS(x) >= 6.0 THEN
+    -- avoid underflow error
+    tau := 0.0;
+  ELSE
+    -- use approximation
+    tau := t * exp(-x*x - 1.26551223
+         + t * (1.00002368
+         + t * (0.37409196
+         + t * (0.09678418
+         + t * (-0.18628806
+         + t * (0.27886807
+         + t * (-1.13520398
+         + t * (1.48851587
+         + t * (-0.82215223
+         + t *  0.17087277)))))))));
+  END IF;
+  IF x >= 0 THEN
+    RETURN 1.0 - tau;
+  ELSE
+    RETURN tau - 1.0;
+  END IF;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION PHI(DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+  SELECT 0.5 * ( 1.0 + erf( $1 / SQRT(2.0) ) );
+$$ LANGUAGE SQL;
+
+CREATE OR REPLACE FUNCTION
+gaussianProba(i INTEGER, mini INTEGER, maxi INTEGER, threshold DOUBLE PRECISION)
+RETURNS DOUBLE PRECISION IMMUTABLE STRICT AS $$
+DECLARE
+  extent DOUBLE PRECISION;
+  mu DOUBLE PRECISION;
+BEGIN
+  extent := maxi - mini + 1.0;
+  mu := 0.5 * (maxi + mini);
+  RETURN (PHI(2.0 * threshold * (i - mini - mu + 0.5) / extent) -
+          PHI(2.0 * threshold * (i - mini - mu - 0.5) / extent))
+         -- truncated gaussian
+	 / ( 2.0 * PHI(threshold) - 1.0 );
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT SUM(cnt) FROM pgbench_dist;
+SELECT id, 1.0*cnt/SUM(cnt) OVER(), gaussianProba(id, 0, 99, 2.0)
+FROM pgbench_dist
+ORDER BY id;
diff --git a/contrib/pgbench/test-gauss-run.sql b/contrib/pgbench/test-gauss-run.sql
new file mode 100644
index 0000000..984a3b4
--- /dev/null
+++ b/contrib/pgbench/test-gauss-run.sql
@@ -0,0 +1,2 @@
+\setrandom id 0 99 gaussian 2.0
+UPDATE pgbench_dist SET cnt=cnt+1 WHERE id = :id;
diff --git a/contrib/pgbench/test-init.sql b/contrib/pgbench/test-init.sql
new file mode 100644
index 0000000..84f7cc9
--- /dev/null
+++ b/contrib/pgbench/test-init.sql
@@ -0,0 +1,4 @@
+DROP TABLE IF EXISTS pgbench_dist;
+CREATE UNLOGGED TABLE pgbench_dist(id SERIAL PRIMARY KEY, cnt INTEGER NOT NULL DEFAULT 0);
+INSERT INTO pgbench_dist(id, cnt) 
+  SELECT i, 0 FROM generate_series(0, 99) AS i;
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index f264c24..276476a 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -748,8 +748,8 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
 
    <varlistentry>
     <term>
-     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</></literal>
-    </term>
+     <literal>\setrandom <replaceable>varname</> <replaceable>min</> <replaceable>max</> [ uniform | [ { gaussian | exponential } <replaceable>threshold</> ] ]</literal>
+     </term>
 
     <listitem>
      <para>
@@ -761,9 +761,76 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </para>
 
      <para>
+      By default, all values in the range are drawn with equal probability,
+      that is the distribution is <literal>uniform</>.
+      The <literal>gaussian</> and <literal>exponential</> options modify
+      this behavior; each requires a mandatory threshold which determines
+      the precise shape of the distribution.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For a Gaussian distribution, the interval is mapped onto a standard
+      <ulink url="http://en.wikipedia.org/wiki/Normal_distribution">normal distribution</ulink>
+      (the classical bell-shaped Gaussian curve) truncated at
+      <literal>-threshold</> on the left and <literal>+threshold</>
+      on the right.
+      <!-- formula -->
+      To be precise, if <literal>PHI(x)</> is the cumulative distribution
+      function of the standard normal distribution, with mean <literal>mu</>
+      defined as <literal>(max + min) / 2.0</>, then value <replaceable>i</>
+      between <replaceable>min</> and <replaceable>max</> inclusive is drawn
+      with probability:
+      <literal>
+        (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)) -
+         PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min + 1))) /
+         (2.0 * PHI(threshold) - 1.0)
+      </>
+      <!-- intuition -->
+      The larger the <replaceable>threshold</>, the more frequently values
+      close to the middle of the interval are drawn, and the less frequently
+      values close to the <replaceable>min</> and <replaceable>max</> bounds.
+      <!-- rule of thumb -->
+      For a Gaussian distribution, about 67% of values are drawn from the middle
+      <literal>1.0 / threshold</> and 95% in the middle <literal>2.0 / threshold</>,
+      thus if <replaceable>threshold</> is 4.0, 67% of values are drawn from the middle
+      quarter and 95% from the middle half of the interval.
+      <!-- constraint -->
+      The minimum <replaceable>threshold</> is 2.0 for performance of
+      the Box-Muller transform.
+     </para>
+
+     <para>
+      <!-- introduction -->
+      For an exponential distribution, the <replaceable>threshold</>
+      parameter controls the distribution by truncating a quickly-decreasing
+      <ulink url="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</ulink>
+      at <replaceable>threshold</>, and then projecting onto integers between
+      the bounds.
+      <!-- formula -->
+      To be precise, value <replaceable>i</> between <replaceable>min</> and
+      <replaceable>max</> inclusive is drawn with probability:
+      <literal>(exp(-threshold*(i-min)/(max+1-min)) -
+       exp(-threshold*(i+1-min)/(max+1-min))) / (1.0 - exp(-threshold))</>.
+      <!-- intuition -->
+      Intuitively, the larger the <replaceable>threshold</>, the more
+      frequently values close to <replaceable>min</> are accessed, and the
+      less frequently values close to <replaceable>max</> are accessed.
+      The closer to 0 the threshold, the flatter (more uniform) the access
+      distribution.
+      <!-- rule of thumb -->
+      A crude approximation of the distribution is that the most frequent 1%
+      values in the range, close to <replaceable>min</>, are drawn
+      <replaceable>threshold</>%  of the time.
+      <!-- constraint -->
+      The <replaceable>threshold</> value must be strictly positive with the
+      <literal>exponential</> option.
+     </para>
+
+     <para>
       Example:
 <programlisting>
-\setrandom aid 1 :naccounts
+\setrandom aid 1 :naccounts gaussian 5.0
 </programlisting></para>
     </listitem>
    </varlistentry>

gauss_B_8.patchtext/x-diff; name=gauss_B_8.patchDownload

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 16e44bd..685ff03 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include <math.h>
 #include <signal.h>
 #include <sys/time.h>
+#include <assert.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -173,6 +174,11 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian/exponential distribution tests */
+double		threshold;          /* threshold for gaussian or exponential */
+bool        use_gaussian = false;
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -294,11 +300,11 @@ static int	num_commands = 0;	/* total number of Command structs */
 static int	debug = 0;			/* debug flag */
 
 /* default scenario */
-static char *tpc_b = {
+static char *tpc_b_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -312,11 +318,11 @@ static char *tpc_b = {
 };
 
 /* -N case */
-static char *simple_update = {
+static char *simple_update_fmt = {
 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"\\setrandom bid 1 :nbranches\n"
 	"\\setrandom tid 1 :ntellers\n"
 	"\\setrandom delta -5000 5000\n"
@@ -328,9 +334,9 @@ static char *simple_update = {
 };
 
 /* -S case */
-static char *select_only = {
+static char *select_only_fmt = {
 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
-	"\\setrandom aid 1 :naccounts\n"
+	"\\setrandom aid 1 :naccounts%s\n"
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
@@ -377,6 +383,8 @@ usage(void)
 		   "  -v, --vacuum-all         vacuum all four standard tables before tests\n"
 		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
 		   "  --sampling-rate=NUM      fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+		   "  --exponential=NUM        exponential distribution with NUM threshold parameter\n"
+		   "  --gaussian=NUM           gaussian distribution with NUM threshold parameter\n"
 		   "\nCommon options:\n"
 		   "  -d, --debug              print debugging output\n"
 	  "  -h, --host=HOSTNAME      database server host or socket directory\n"
@@ -2341,6 +2349,30 @@ process_builtin(char *tb)
 	return my_commands;
 }
 
+/*
+ * compute the probability of the truncated gaussian random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double gaussianProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (0.50 * (erf (threshold * (1.0 - 1.0 / slots * (2.0 * i - 2.0)) / sqrt(2.0)) -
+		erf (threshold * (1.0 - 1.0 / slots * 2.0 * i) / sqrt(2.0))) /
+		erf (threshold / sqrt(2.0)));
+}
+
+/*
+ * compute the probability of the truncated exponential random generation
+ * to draw values in the i-th slot of the range.
+ */
+static double exponentialProbability(int i, int slots, double threshold)
+{
+	assert(1 <= i && i <= slots);
+	return (exp(- threshold * (i - 1) / slots) - exp(- threshold * i / slots)) /
+		(1.0 - exp(- threshold));
+}
+
+
 /* print out results */
 static void
 printResults(int ttype, int64 normal_xacts, int nclients,
@@ -2352,7 +2384,7 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	double		time_include,
 				tps_include,
 				tps_exclude;
-	char	   *s;
+	char	   *s, *d;
 
 	time_include = INSTR_TIME_GET_DOUBLE(total_time);
 	tps_include = normal_xacts / time_include;
@@ -2368,8 +2400,46 @@ printResults(int ttype, int64 normal_xacts, int nclients,
 	else
 		s = "Custom query";
 
-	printf("transaction type: %s\n", s);
+	if (use_gaussian)
+		d = "Gaussian distribution ";
+	else if (use_exponential)
+		d = "Exponential distribution ";
+	else
+		d = ""; /* default uniform case */
+
+	printf("transaction type: %s%s\n", d, s);
 	printf("scaling factor: %d\n", scale);
+
+	/* output in gaussian distribution benchmark */
+	if (use_gaussian)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated gaussian distribution\n");
+		printf("standard deviation threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%", 100.0 * gaussianProbability(i, 10.0, threshold));
+		printf("\n");
+		printf("probability of max/min percent of the range: %.1f%% %.1f%%\n",
+			  100.0 * gaussianProbability(50, 100, threshold),
+			  100.0 * gaussianProbability(100, 100, threshold));
+	}
+	/* output in exponential distribution benchmark */
+	else if (use_exponential)
+	{
+		int i;
+		printf("pgbench_account's aid selected with a truncated exponential distribution\n");
+		printf("exponential threshold: %.5f\n", threshold);
+		printf("decile percents:");
+		for (i = 1; i <= 10; i++)
+			printf(" %.1f%%",
+				   100.0 * exponentialProbability(i, 10, threshold));
+		printf("\n");
+		printf("probability of fist/last percent of the range: %.1f%% %.1f%%\n",
+			   100.0 * exponentialProbability(1, 100, threshold),
+			   100.0 * exponentialProbability(100, 100, threshold));
+	}
+
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
 	printf("number of threads: %d\n", nthreads);
@@ -2500,6 +2570,8 @@ main(int argc, char **argv)
 		{"unlogged-tables", no_argument, &unlogged_tables, 1},
 		{"sampling-rate", required_argument, NULL, 4},
 		{"aggregate-interval", required_argument, NULL, 5},
+		{"gaussian", required_argument, NULL, 6},
+		{"exponential", required_argument, NULL, 7},
 		{"rate", required_argument, NULL, 'R'},
 		{NULL, 0, NULL, 0}
 	};
@@ -2780,6 +2852,25 @@ main(int argc, char **argv)
 				}
 #endif
 				break;
+			case 6:
+				use_gaussian = true;
+				threshold = atof(optarg);
+				if(threshold < MIN_GAUSSIAN_THRESHOLD)
+				{
+					fprintf(stderr, "--gaussian=NUM must be more than %f: %f\n",
+							MIN_GAUSSIAN_THRESHOLD, threshold);
+					exit(1);
+				}
+				break;
+			case 7:
+				use_exponential = true;
+				threshold = atof(optarg);
+				if(threshold <= 0.0)
+				{
+					fprintf(stderr, "--exponential=NUM must be more than 0.0\n");
+					exit(1);
+				}
+				break;
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 				exit(1);
@@ -2977,6 +3068,17 @@ main(int argc, char **argv)
 		}
 	}
 
+	/* set :threshold variable */
+	if(getVariable(&state[0], "threshold") == NULL)
+	{
+		snprintf(val, sizeof(val), "%lf", threshold);
+		for (i = 0; i < nclients; i++)
+		{
+			if (!putVariable(&state[i], "startup", "threshold", val))
+				exit(1);
+		}
+	}
+
 	if (!is_no_vacuum)
 	{
 		fprintf(stderr, "starting vacuum...");
@@ -2999,25 +3101,24 @@ main(int argc, char **argv)
 	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));
 
 	/* process builtin SQL scripts */
-	switch (ttype)
+	if (ttype < 3)
 	{
-		case 0:
-			sql_files[0] = process_builtin(tpc_b);
-			num_files = 1;
-			break;
-
-		case 1:
-			sql_files[0] = process_builtin(select_only);
-			num_files = 1;
-			break;
-
-		case 2:
-			sql_files[0] = process_builtin(simple_update);
-			num_files = 1;
-			break;
-
-		default:
-			break;
+		char *fmt, *distribution, *queries;
+		int ret;
+		fmt = (ttype == 0)? tpc_b_fmt:
+			  (ttype == 1)? select_only_fmt:
+			  (ttype == 2)? simple_update_fmt: NULL;
+		assert(fmt != NULL);
+		distribution =
+			use_gaussian? " gaussian :threshold":
+			use_exponential? " exponential :threshold":
+			"" /* default uniform case */ ;
+		queries = pg_malloc(strlen(fmt) + strlen(distribution) + 1);
+		ret = sprintf(queries, fmt, distribution);
+		assert(ret >= 0);
+		sql_files[0] = process_builtin(queries);
+		num_files = 1;
+		pg_free(queries);
 	}
 
 	/* set up thread data structures */
diff --git a/doc/src/sgml/pgbench.sgml b/doc/src/sgml/pgbench.sgml
index 276476a..b718dd3 100644
--- a/doc/src/sgml/pgbench.sgml
+++ b/doc/src/sgml/pgbench.sgml
@@ -307,6 +307,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--exponential</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run exponential distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated exponential distribution
+exponential threshold: 5.00000
+decile percents: 39.6% 24.0% 14.6% 8.8% 5.4% 3.3% 2.0% 1.2% 0.7% 0.4%
+probability of fist/last percent of the range: 4.9% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn 39.6% of the time, that is about 4 times more than average.
+         The second decile, from 100,001 to 200,000 is drawn 24.0% of the time,
+         that is 2.4 times more than average.
+         Up to the last decile, from 900,001 to 1,000,000, which is drawn
+         0.4% of the time, well below average.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 1 to 10,000, is drawn 4.9% of the time, this 4.9 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-f</option> <replaceable>filename</></term>
       <term><option>--file=</option><replaceable>filename</></term>
       <listitem>
@@ -320,6 +363,49 @@ pgbench <optional> <replaceable>options</> </optional> <replaceable>dbname</>
      </varlistentry>
 
      <varlistentry>
+      <term><option>--gaussian</option><replaceable>threshold</></term>
+      <listitem>
+       <para>
+         Run gaussian distribution pgbench test using this threshold parameter.
+         The threshold controls the distribution of access frequency on the
+         <structname>pgbench_accounts</> table.
+         See the <literal>\setrandom</> documentation below for details about
+         the impact of the threshold value.
+         When set, this option applies to all test variants (<option>-N</> for
+         skipping updates, or <option>-S</> for selects).
+       </para>
+
+       <para>
+         When run, the output is expanded to show the distribution
+         depending on the <replaceable>threshold</> value:
+
+<screen>
+...
+pgbench_account's aid selected with a truncated gaussian distribution
+standard deviation threshold: 5.00000
+decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%
+probability of max/min percent of the range: 4.0% 0.0%
+...
+</screen>
+
+         The figures are to be interpreted as follows.
+         If the scaling factor is 10, there are 1,000,000 accounts in
+         <literal>pgbench_accounts</>.
+         The first decile, with <literal>aid</> from 1 to 100,000, is
+         drawn less than 0.1% of the time.
+         The second, from 100,001 to 200,000 is drawn about 0.1% of the time...
+         up to the fifth decile, from 400,001 to 500,000, which
+         is drawn 34.1% of the time, about 3.4 times more thn average,
+         and then the gaussian curve is symmetric for the last five deciles.
+         Moreover, the first percent of the range, that is <literal>aid</>
+         from 50,001 to 60,000, is drawn 4.0% of the time, 4.0 times more
+         than average, and the last percent, with <literal>aid</>
+         from 990,001 to 1,000,000, is drawn less than 0.1% of the time.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
       <term><option>-j</option> <replaceable>threads</></term>
       <term><option>--jobs=</option><replaceable>threads</></term>
       <listitem>

#32

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#31)

Re: gaussian distribution pgbench -- splits v4

On Tue, Jul 29, 2014 at 4:41 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Attached B patch does turn incorrect setrandom syntax into errors instead
of ignoring extra parameters.

First A patch is repeated to help commitfest references.

Oops, I applied the change on the wrong part:-(

Here is the change on part A which checks setrandom syntax, and B for
completeness.

I've committed the changes to pgbench.c and the documentation changes
with some further wordsmithing. I don't think including the other
changes in patch A is a good idea, nor am I in favor of patch B. But
thanks for your and Kondo-san's hard work on this; I think this will
be quite useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#32)

Re: gaussian distribution pgbench -- splits v4

Hello Robert,

I've committed the changes to pgbench.c and the documentation changes
with some further wordsmithing.

Ok, thanks a lot for your reviews and your help with improving the
documentation.

I don't think including the other changes in patch A is a good idea,

Fine. It was mostly for testing and checking purposes.

nor am I in favor of patch B.

Yep. Would providing these as additional contrib files be more acceptable?
Something like "tpc-b-gauss.sql"... Otherwise there is no example
available to show the feature.

Thanks again,

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#33)

Re: gaussian distribution pgbench -- splits v4

Hi,

2014-07-31 5:18 GMT+09:00 Fabien COELHO <coelho@cri.ensmp.fr>:

I've committed the changes to pgbench.c and the documentation changes

with some further wordsmithing.

Ok, thanks a lot for your reviews and your help with improving the
documentation.

Yeah, thanks for all relative members.

I don't think including the other changes in patch A is a good idea,

Fine. It was mostly for testing and checking purposes.

Hmm... It doesn't have harm for pgbench source code. And, in general,
checking script is useful for avoiding bug.

nor am I in favor of patch B.

Yep.

No, patch B is still needed. Please tell me the reason. I don't like
deciding by someones feeling,
and it needs logical reason. Our documentation is better than the past. I
think it can easy to understand decile probability.
This part of the discussion is needed to continue...

Would providing these as additional contrib files be more acceptable?

Something like "tpc-b-gauss.sql"... Otherwise there is no example available
to show the feature.

I agree the test script and including command line options. It's not harm,
and it's useful.

Best regards,
--
Mitsumasa KONDO

#35

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#33)

Re: gaussian distribution pgbench -- splits v4

On Wed, Jul 30, 2014 at 4:18 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

nor am I in favor of patch B.

Yep. Would providing these as additional contrib files be more acceptable?
Something like "tpc-b-gauss.sql"... Otherwise there is no example available
to show the feature.

To be honest, it just feels like clutter to me. If we added examples
for every feature that is as significant as this one is, we'd end up
with twice the installation footprint, and most of it would be stuff
nobody ever looked at. I think the documentation is good enough that
people will be able to understand how to use this feature, which is
good enough for me.

One thing that might still be worth doing is including the standard
pgbench scripts in the pgbench documentation. Then we could say
things like "and you could also modify these". Right now I tend to
end up cut-and-pasting from the source code, which is fine if you're a
hacker but not so user-friendly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Mitsumasa KONDO (#34)

Re: gaussian distribution pgbench -- splits v4

On Wed, Jul 30, 2014 at 9:00 PM, Mitsumasa KONDO
<kondo.mitsumasa@gmail.com> wrote:

Hmm... It doesn't have harm for pgbench source code. And, in general,
checking script is useful for avoiding bug.

Not if nobody runs it, or if people run it but don't know what the
output should look like. I think anyone who knows enough to find bugs
by running these scripts probably doesn't need the scripts.

No, patch B is still needed. Please tell me the reason. I don't like
deciding by someones feeling,
and it needs logical reason. Our documentation is better than the past. I
think it can easy to understand decile probability.
This part of the discussion is needed to continue...

Would providing these as additional contrib files be more acceptable?
Something like "tpc-b-gauss.sql"... Otherwise there is no example available
to show the feature.

I agree the test script and including command line options. It's not harm,
and it's useful.

As to all of this, I simply don't agree that the stuff has enough
value to justify including it. Now, of course, that is subjective:
one person may think it has enough value, while another person may
think that it does not have enough value. So it just comes down to a
question of opinion, and we make those judgements of opinion all the
time. If we included everything that everyone who works on the code
wants included, we'd end up with a bloated mess of stuff that nobody
cares about; indeed, we have a significant amount of stuff in the
source code that IMHO looks like somebody's debugging leftovers that
should have been removed before commit. I don't want to add more
unless there is clear and convincing evidence that a significant
number of people want it, and that is not the case here.

Now, if we get a few reports from people saying, hey, I was doing some
benchmarking with pgbench, and I found the new gaussian feature to be
really excellent, but it sucked that there was no command-line option
for it, we can go back and add one. No problem! But in the meantime,
we've added the core of the feature without cluttering up the list of
command-line options with things that may or may not prove to be
useful.

One of the concerns that I have about the proposal of simply slapping
a gaussian or exponential modifier onto \setrandom aid 1 :naccounts is
that, while it will allow you to make part of the relation hot and
another part of the relation cold, you really can't get any more
fine-grained than that. If you use exponential, all the hot accounts
will be near the beginning of the relation, and if you use gaussian,
they'll all be in the middle. I'm not sure exactly will happen after
some updating has happened; I'm guessing some of the keys will still
be in their original location and others will have been pushed to the
end of the relation following relation-extension. But there's no way,
with those command line options, to for example have 5 hot spots
distributed uniformly through the relation; or even to have the end of
the relation rather than the beginning or the middle as the hot spot.
You can do those things with the newly-enhanced \setrand *and a custom
script* but not with just a command-line option. So that makes me
think that people who find these new facilities useful might not get
all that much use out of the command-line option anyway; and we can't
have a command-line option for every behavior anyone ever wants.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#36)

Re: gaussian distribution pgbench -- splits v4

Hello Robert,

[...]

One of the concerns that I have about the proposal of simply slapping a
gaussian or exponential modifier onto \setrandom aid 1 :naccounts is
that, while it will allow you to make part of the relation hot and
another part of the relation cold, you really can't get any more
fine-grained than that. If you use exponential, all the hot accounts
will be near the beginning of the relation, and if you use gaussian,
they'll all be in the middle.

That is a very good remark. Although I thought of it, I do not have a very
good solution yet:-)

From a testing perspective, if we assume that keys have no semantics, a
reasonable assumption is that the distribution of access for actual
realistic workloads is probably exponential (of gaussian, anyway hardly
uniform), but without direct correlation between key values.

In order to simulate that, we would have to apply a fixed (pseudo-)random
permutation to the exponential-drawn key values. This is a non trivial
problem. The version zero of solving it is to do nothing... it is the
current status;-) Version one is "k' = 1 + (a * k + b) modulo n" with "a"
prime with respect to "n", "n" being the number of keys. This is nearly
possible, but for the modulo operator which is currently missing, and that
I'm planning to submit for this very reason, but probably another time.

I'm not sure exactly will happen after some updating has happened; I'm
guessing some of the keys will still be in their original location and
others will have been pushed to the end of the relation following
relation-extension.

This is a not too bad side. What matters most in the long term is not the
key value correlation, but the actual storage correlation, i.e. whether
two tuples required are in the same page or not. At the beginning of a
simulation, with close key numbers being picked up with an exponential
distribution, the correlation is more that what would be expected.
However, once a significant amount of the table has been updated, this
initial artificial correlation is going to fade, and the test should
become more realistic.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#37)

Re: gaussian distribution pgbench -- splits v4

On Thu, Jul 31, 2014 at 10:01 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

One of the concerns that I have about the proposal of simply slapping a
gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that,
while it will allow you to make part of the relation hot and another part of
the relation cold, you really can't get any more fine-grained than that. If
you use exponential, all the hot accounts will be near the beginning of the
relation, and if you use gaussian, they'll all be in the middle.

That is a very good remark. Although I thought of it, I do not have a very
good solution yet:-)

From a testing perspective, if we assume that keys have no semantics, a
reasonable assumption is that the distribution of access for actual
realistic workloads is probably exponential (of gaussian, anyway hardly
uniform), but without direct correlation between key values.

In order to simulate that, we would have to apply a fixed (pseudo-)random
permutation to the exponential-drawn key values. This is a non trivial
problem. The version zero of solving it is to do nothing... it is the
current status;-) Version one is "k' = 1 + (a * k + b) modulo n" with "a"
prime with respect to "n", "n" being the number of keys. This is nearly
possible, but for the modulo operator which is currently missing, and that
I'm planning to submit for this very reason, but probably another time.

That's pretty crude, although I don't object to a modulo operator. It
would be nice to be able to use a truly random permutation, which is
not hard to generate but probably requires O(n) storage, likely a
problem for large scale factors. Maybe somebody who knows more math
than I do (like you, probably!) can come up with something more
clever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Fabien COELHO

coelho@cri.ensmp.fr

over 11 years ago

In reply to: Robert Haas (#38)

Re: gaussian distribution pgbench -- splits v4

Hello,

Version one is "k' = 1 + (a * k + b) modulo n" with "a" prime with
respect to "n", "n" being the number of keys. This is nearly possible,
but for the modulo operator which is currently missing, and that I'm
planning to submit for this very reason, but probably another time.

That's pretty crude,

Yep. It is very simple, it is much better than nothing, and for a database
test is may be "good enough".

although I don't object to a modulo operator. It would be nice to be
able to use a truly random permutation, which is not hard to generate
but probably requires O(n) storage, likely a problem for large scale
factors.

That is indeed the actual issue in my mind. I was thinking of permutations
with a formula, which are not so easy to find and may end-up looking like
"(a*k+b)%n" anyway. I had the same issue for generating random data for a
schema (see http://www.coelho.net/datafiller.html).

Maybe somebody who knows more math than I do (like you, probably!) can
come up with something more clever.

I can certainly suggest other formula, but that does not mean beautiful
code, thus would probably be rejected. I'll see.

An alternative to this whole process may be to hash/modulo a non uniform
random value.

id = 1 + hash(some-random()) % n

But the hashing changes the distribution as it adds collisions, so I have
to think about how to be able to control the distribution in that case,
and what hash function to use.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Mitsumasa KONDO

kondo.mitsumasa@gmail.com

over 11 years ago

In reply to: Fabien COELHO (#39)

Re: gaussian distribution pgbench -- splits v4

Hi,

2014-08-01 16:26 GMT+09:00 Fabien COELHO <coelho@cri.ensmp.fr>

Maybe somebody who knows more math than I do (like you, probably!) can

come up with something more clever.

I can certainly suggest other formula, but that does not mean beautiful
code, thus would probably be rejected. I'll see.

An alternative to this whole process may be to hash/modulo a non uniform
random value.

id = 1 + hash(some-random()) % n

But the hashing changes the distribution as it adds collisions, so I have
to think about how to be able to control the distribution in that case, and
what hash function to use.

I think that we have to consider and select reproducible method, because
benchmark is always needed robust and reproducible result. And if we
realize this idea, we might need more accurate random generator that is
like Mersenne twister algorithm. erand48 algorithm is slow and not
accurate very much.

By the way, I don't know relativeness of this topic and command line
option... Well whatever...

Regards,
--
Mitsumasa KONDO