You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
Build Identifier:
The current implementation takes two scans of the BAT.
First scan to compute the sum (average)
Second scan to compute stdev.
By using a well-known mathematical property of the stdev, one scan is enough:
during the scan,
the sum of the value and the square of the value can be computed together, and the stdev can be computed then.
Reproducible: Always
Yin's one scan implementation is available by the following diff:
This is nice! However, I'm wondering what the error bound of this approach is. Agreed this would be faster, but if it is less "correct" we might want to introduce a variant of stddev that uses your implementation. If this is a very broadly accepted way of computing stddev, we might also drop our current "exact" implementation, or make it available under another name, such that the user can choose from the two.
Comment 17967
Date: 2012-11-20 20:26:54 +0100
From: Yin <>
Thank you very much for your comments, Fabian.
I had the same concern about the accumulation error of one-scan approach as you do, even though mathematically they should be exactly the same. I remember about 10 years ago, I did some experimentations on both of the approaches and I did not found any noticeable differences between the results, but the sample size I tried then is not huge.
For the sample size of the order 1 trillion, I have the same concern even with the two-scan approach. For example, After 500 billion entries have been added to the sum of square, the rest may not be able to be added up anymore, since it may well with the range of accumulation errors! For a huge BAT, if the precision is top priority, we may need to use an hierarchical approach: divide the huge sample into some manageable block, compute it for each block first, and then aggregate the results.
The sentiment of doing a single scan calculation is good, but I don't think this is the right way to do it. The problem is with the (large) potential for overflow and for large errors in the result. The overflow can arise easily because you calculate the sum of the squares (note that this is not different from the current implementation). When calculating this sum (and also the normal sum), the order in which the calculation is done can cause errors. If the first values are large and later values are small, the small ones may get ignored because they don't register. Try e.g. a table that starts with one large number (1e8 will probably do) that is followed by millions of small (but not zero, e.g. 1) numbers.
Definitely, there are a lot of rooms for improvement over the naive implementation.
If the data are approximately coming from a normal distribution where the absolute value of mean is not too big compared with its stddev, then the naive implementation will yield reasonable results. For the data sampling from Cauchy distribution, the mean and stddev are not well defined anyway, no matter what algorithm you use, the computed value of stddev does not have much meaning. The situations are similar for other long tail distributions such as power law distribution, where stddev is not a description for the dataset.
Test added in monetdb5/tests/BugTracker/Tests/stdev.Bug-3178.mal
test results are different from spreadsheet results
Comment 18162
Date: 2012-11-27 17:42:30 +0100
From: Yin <>
Adding a test is definitely very helpful.
In the spreadsheet you used for the result comparison, what kind of estimator is used, unbiased or biased? The original implementation in MonetDB (or the proposed one scan implementation) is the population variance of a finite population of size N, that is different from that of unbiased sample variance. Please see the link for the definition: http://en.wikipedia.org/wiki/Variance
For a large sample size (a large value of N), the difference should not be significant.
Date: 2012-11-06 22:59:41 +0100
From: Yin <>
To: MonetDB5 devs <>
Version: 11.13.9 (Oct2012-SP3)
CC: @yzchang
Last updated: 2013-02-19 13:17:56 +0100
Comment 17881
Date: 2012-11-06 22:59:41 +0100
From: Yin <>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0) Gecko/20100101 Firefox/4.0
Build Identifier:
The current implementation takes two scans of the BAT.
First scan to compute the sum (average)
Second scan to compute stdev.
By using a well-known mathematical property of the stdev, one scan is enough:
during the scan,
the sum of the value and the square of the value can be computed together, and the stdev can be computed then.
Reproducible: Always
Yin's one scan implementation is available by the following diff:
diff -r bcdb312657ff monetdb5/modules/kernel/algebra.mx
--- a/monetdb5/modules/kernel/algebra.mx Tue Nov 06 12:03:33 2012 -0800
+++ b/monetdb5/modules/kernel/algebra.mx Tue Nov 06 13:40:09 2012 -0800
@@ -4149,47 +4149,64 @@
}
/*
*/
@= stdev_impl
str ALGstdev_@1(dbl *res, int *bid) {
BAT *b;
throw(MAL, "algebra.stdev", RUNTIME_OBJECT_MISSING);
}
*res = dbl_nil;
}
Comment 17941
Date: 2012-11-13 21:50:08 +0100
From: Yin <>
Created attachment 158
compute stdev with one scan of BAT
Comment 17960
Date: 2012-11-20 13:36:41 +0100
From: @grobian
Hi Yin,
This is nice! However, I'm wondering what the error bound of this approach is. Agreed this would be faster, but if it is less "correct" we might want to introduce a variant of stddev that uses your implementation. If this is a very broadly accepted way of computing stddev, we might also drop our current "exact" implementation, or make it available under another name, such that the user can choose from the two.
Comment 17967
Date: 2012-11-20 20:26:54 +0100
From: Yin <>
Thank you very much for your comments, Fabian.
I had the same concern about the accumulation error of one-scan approach as you do, even though mathematically they should be exactly the same. I remember about 10 years ago, I did some experimentations on both of the approaches and I did not found any noticeable differences between the results, but the sample size I tried then is not huge.
For the sample size of the order 1 trillion, I have the same concern even with the two-scan approach. For example, After 500 billion entries have been added to the sum of square, the rest may not be able to be added up anymore, since it may well with the range of accumulation errors! For a huge BAT, if the precision is top priority, we may need to use an hierarchical approach: divide the huge sample into some manageable block, compute it for each block first, and then aggregate the results.
Comment 17968
Date: 2012-11-20 23:19:15 +0100
From: @sjoerdmullender
The sentiment of doing a single scan calculation is good, but I don't think this is the right way to do it. The problem is with the (large) potential for overflow and for large errors in the result. The overflow can arise easily because you calculate the sum of the squares (note that this is not different from the current implementation). When calculating this sum (and also the normal sum), the order in which the calculation is done can cause errors. If the first values are large and later values are small, the small ones may get ignored because they don't register. Try e.g. a table that starts with one large number (1e8 will probably do) that is followed by millions of small (but not zero, e.g. 1) numbers.
I am currently investigating a different single scan approach (see http://en.wikipedia.org/wiki/Algorithms_for_calculating_varianceOnline_algorithm).
Comment 17977
Date: 2012-11-21 19:05:42 +0100
From: Yin <>
Definitely, there are a lot of rooms for improvement over the naive implementation.
If the data are approximately coming from a normal distribution where the absolute value of mean is not too big compared with its stddev, then the naive implementation will yield reasonable results. For the data sampling from Cauchy distribution, the mean and stddev are not well defined anyway, no matter what algorithm you use, the computed value of stddev does not have much meaning. The situations are similar for other long tail distributions such as power law distribution, where stddev is not a description for the dataset.
Comment 18035
Date: 2012-11-27 11:43:48 +0100
From: @yzchang
Test added in monetdb5/tests/BugTracker/Tests/stdev.Bug-3178.mal
test results are different from spreadsheet results
Comment 18162
Date: 2012-11-27 17:42:30 +0100
From: Yin <>
Adding a test is definitely very helpful.
In the spreadsheet you used for the result comparison, what kind of estimator is used, unbiased or biased? The original implementation in MonetDB (or the proposed one scan implementation) is the population variance of a finite population of size N, that is different from that of unbiased sample variance. Please see the link for the definition:
http://en.wikipedia.org/wiki/Variance
For a large sample size (a large value of N), the difference should not be significant.
Comment 18187
Date: 2012-11-28 13:44:59 +0100
From: @yzchang
Changeset 19a439c07ec9 made by Jennie Zhang y.zhang@cwi.nl in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=19a439c07ec9
Changeset description:
Comment 18219
Date: 2012-11-29 10:54:04 +0100
From: @sjoerdmullender
Changeset bfb1f607de02 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=bfb1f607de02
Changeset description:
Comment 18450
Date: 2013-01-29 14:47:50 +0100
From: @sjoerdmullender
Since changeset 498738535dfe the single scan implementation is used for the stddev_samp and stddev_pop (and var_samp and var_pop) aggregates in SQL.
Comment 18507
Date: 2013-02-19 13:17:56 +0100
From: @sjoerdmullender
Feb2013 has been released.
The text was updated successfully, but these errors were encountered: