Bayes and certainty

Main article: Technical overview of the ratings system

Introduction – Bayes pulls toward certainty

We’ve already discussed the fact that Bayes pulls in favor of certainty. If we combine a 99% opinion with a 10% opinion we get 91.7%. But if we increase the 99% to 99.9% the combined opinion rises to 99.1%. If we increase yet again to 99.99% the combined opinion rises to 99.91%. We summarize this below to make it easy to see:

99% combined with 10% ==> 91.7%

99.9% combined with 10% ==> 99.1%

99.99% combined with 10% ==> 99.91%

See what’s happening here? The combined opinion gets pulled very strongly toward the 1st opinion the more certain the first opinion becomes. It seems a little strange that what seems like small differences in the first opinion should have such a pronounced effect.

A more intuitive explanation

The Bayes eqn will confirm the truth of this but this situation begs for a more intuitive explanation. Although 99% and 99.9% don’t seem like much of a difference, they represent a huge difference in sample sizes. To be able to say 99%, one must perform at least 100 experiments, 99 of which succeeded and 1 which failed. To be able to say 99.9%, one must perform at least 1000 experiments where 999 succeeded and 1 failed. And so on.

The evidential difference between 99 and 99.9% is now clearly huge. By performing 1000 experiments a scientist has built an almost unassailable advantage vs someone who would challenge his results with just 100 experiments. The challenger would also have to perform 1000 experiments just to compete and many more to show that the first scientist’s results were conclusively wrong.

This also explains why any opinion, except 0%, combined with 100% yields 100% in Bayes. The one exception, the 0% opinion, combined with 100% is not computable and yields no result. In the sense outlined here, a 100% opinion doesn’t really exist because it implies that the scientist must perform an infinite number of experiments to attain it. The same is true for 0%. Anyone who performs, say 1000 experiments, and claims that all of them succeeded (thus yielding a “100%” success rate) can be challenged as simply not having performed enough experiments.

Keep in mind that a probability is something that can happen, or not, through random chance. We are not talking about logical or mathematical statements, for example. You cannot say that 2+2=4 100% of the time and call this a probabilistic statement. Such a statement is always true and there is no random event that can alter it. And, needless to say, statements of judgement are not probabilities either. If someone says they are 100% sure it will rain tomorrow, they are either rounding, estimating and exaggerating, or using 100% as an English synonym for “very certain”. Whenever a probability is used, it must include the random chance that the opposite event will take place, however small it might be.

Example

Let’s relate this to Bayes using a concrete example. A test for a disease is 99% accurate. So, as we’ve just seen, the experiment leading to that 99% must have performed at least 100 tests to confirm. Let’s suppose that a 2nd test is 10% accurate. So we have the following situation:

100 healthy subjects ==> Test 1 is 99% accurate ==> 1 tests positive (false positive) ==> Test 2 is 10% accurate ==> .9 test positive again.

100 sick subjects ==> Test 1 is 99% accurate ==> 99 test positive (true positive) ==> Test 2 is 10% accurate ==> 9.9 test positive again

The number of people who tested positive twice = 0.9 + 9.9 = 10.8

Of these, the number of people who are actually sick: 9.9

P = Number of sick people who tested positive twice / Number of people who tested positive twice = 9.9/(9.9 + 0.9) = 0.917

We can confirm that this is equivalent to the Bayes eqn as follows:

9.9/ 10.8 = 9.9 / (9.9 + 0.9) = 99(0.1)/((99(0.1) + 1(1-0.1)) = 100(0.99)(0.1) / ((100(0.99)(0.1) + (100 - 100(.99))(1-0.1)) =

0.99(0.1) / ((0.99(0.1) + (1-0.99)(1-0.1) )

We recognize this result as the Bayes eqn. for this case.

Now let’s do the same with a test that is 99.9% accurate. We need 1000 subjects in this case:

1000 healthy subjects ==> Test 1 is 99.9% accurate ==> 1 tests positive (false positive) ==> Test 2 is 10% accurate ==> .9 test positive again.

1000 sick subjects ==> Test 1 is 99% accurate ==> 999 test positive ==> Test 2 is 10% accurate ==> 99.9 test positive again

The number of people who tested positive twice = 0.9 + 99.9 = 100.8

Of these, the number of people who are actually sick: 99.9

P = Number of sick who tested positive twice / Number of people who tested positive twice = 99.9/(99.9 + 0.9) = 0.991

Let’s juxtapose the two relevant calculations:

9.9/(9.9 + 0.9) = 0.917

99.9/(99.9 + 0.9) = 0.991

We see here that the numerator increases by a factor of 10, corresponding to the 10 times increase in sample size, and representing the number of people who test positive twice who are actually sick. The denominator is composed of this same number plus a constant, the number of people who falsely test positive twice. This constant remains so because as the test becomes more accurate, ever fewer results are bad (in percentage terms). For every 9 added to the decimal place, it becomes about 10 times harder to have a bad result. This is the basic reason why certainty pulls so hard in its favor.

Counteracting certainty and OOM

Indeed, to counteract the effect of certainty we would need an equal and opposite level of certainty for the opposite opinion. If 99% represents our probability of a True result, then a 1% opinion for True is it’s opposite (99% for False). When combined these two yield, via Bayes,

0.99(0.1) / (0.99(0.1) + 0.1(0.99)) = 0.5

and thus cancel themselves out. 99.9% would require a 0.1% counterbalance to cancel itself, and so on. In terms of the evidentiary weight argument we’ve been using, the 99.9% requires 1000 experiments to establish but the 0.1% also requires 1000 experiments. They are “equal” in that sense.

We can understand this numerically in terms of order of magnitude (OOM). The counterbalance against certainty (eg 99.9%) is a small number (eg 0.1%) which causes the failures to be ever smaller compared to the successes, as we’ve seen. For this case, 0.1% equates to 0.001 which has order of magnitude of -3. 99.99% (0.01%) would have order of magnitude of -4 and have 10 times the evidence in favor of it. Any time we see an order of magnitude difference, we know we are dealing with experiments which have large discrepancies in their evidence and their Bayesian combination will favor greatly the experiment with the more negative OOM.

To an extent we can also understand this in terms of decimal places. The more decimal places an opinion has, the more tests were done to confirm it, with each decimal place adding an order of magnitude to the the number of tests.

This only works at the certainty (99%, 99.9%, 99.99%, etc.) and uncertainty end (1%, 0.1%, 0.01%, etc.) of the probability spectrum, which is why the OOM view is the more accurate way to look at it. If someone says they are 10.001% certain of something they must have done at least 100,000 experiments to confirm that (10001 succeeded, 89999 failed). But in Bayes, this is not very different than someone saying 10% where only 10 experiments were conducted for confirmation (9 succeeded, 1 failed). When combined with a 99.999%, which has the same number of decimal places, the result will be 99.991%, very close to 99.999%. The “decimal place heuristic” clearly only works at the ends of the spectrum (where the smaller opinions exhibit a difference in OOM).

Using decimal places to combine and evaluate experimental evidence

Nevertheless, this idea asks us to consider what happens when two opinions differ in the evidentiary weight behind them. Clearly an experiment with 1000 tests is not equivalent to one with 10. The Bayes equation, however, has no mechanism to judge the quality of the probabilities inserted into it except when those probabilities clearly differ in terms of their OOM.

So one idea to handle this is to weigh the opinions as follows:

Use the regular Bayes equation to combine the two probabilities. This is the one limit.
Use the experiment with the most evidence for it (most decimal places) to establish another limit, ${\textstyle P_{1}}$ .
Decide on a factor for the difference in evidence between the two experiments using the decimal place of the reported probabilities. An 80.3% probability vs. 73% has one more decimal place and so gives the 80.3% opinion 10 times the weight of the 73% opinion. This leads to the equation ${\textstyle f=10^{Nd_{1}-Nd_{2}}}$ where ${\textstyle Nd_{1}}$ is the number of decimal places of the most accurate probability. Of course, if the probability comes with the sample size used to calculate it we can just divide the sample sizes to obtain the factor.
Use the factor just calculated as a weighting factor in establishing the combined opinion between the lower and upper limit.

For example, using ${\textstyle P_{1}=0.803}$ and ${\textstyle P_{2}=0.73}$ ,

${\textstyle P\_{bay}={0.803(0.73) \over {0.803(0.73)+0.197(0.27)}}=0.917}$
${\textstyle P\_{1}=0.803}$
${\textstyle f=10^{(Nd_{1}-Nd_{2})}=10^{3-2}=10}$
${\textstyle P_{comb,w}=0.803+{(0.917-0.803) \over 10}=0.8144}$

Given how close the result is to ${\textstyle P_{1}}$ we might just consider skipping the calculation and using ${\textstyle P_{1}}$ in situations where an order of magnitude or more separates the two experiments. The calculation might be more useful in cases where we can’t see any difference in decimal place accuracy but know that significantly different sample sizes were used.

This idea is also useful for evaluating the relative merits of two experiments without necessarily combining them via Bayes. If two scientists disagree and one has significantly more experimental evidence, then we could use the above idea to perform a weighted average of their opinions.

Reporting results correctly using decimal places

This discussion assumes that probabilities will be reported to the correct number of decimal places. If we do 11 experiments and 9 succeed we could claim

P = 9/11 = 0.818181818181

but this would be to represent the result with far too many decimal places, implying that many more experiments had been done to confirm it. The correct decimal representation is 0.8, implying that about 10 experiments have taken place, leading to one decimal place.

Therefore we need to watch when reports are made with a suspiciously large number of decimal places. Sources should be encouraged to report their experimental results as fractions where we can see openly in the denominator how many experiments were conducted.