In many discussions I'm pulled back to the distinction between not-guilty
and innocent
as a way to demonstrate how the burden of proof works and what the true default position should be in any given argument. A lot of people seem to not have any problem seeing the distinction, but many intelligent people for some reason don't see it.
In this article I explain why the distinction exists and why it matters, in particular why it matters in real-life scenarios, especially when people try to shift the burden of proof.
Essentially, in my view the universe we are talking about is {uncertain,guilty,innocent}
, therefore not-guilty
is guilty'
, which is {uncertain,innocent}
. Therefore innocent ⇒ not-guilty
, but not-guilty ⇏ innocent
.
When O. J. Simpson was acquitted, that doesn’t mean he was found innocent, it means the prosecution could not prove his guilt beyond reasonable doubt. He was found not-guilty, which is not the same as innocent. It very well could be that the jury found the truth of the matter uncertain
.
This notion has implications in many real-life scenarios when people want to shift the burden of proof if you reject a claim when it's not substantiated. They wrongly assume you claim their claim is false (equivalent to innocent
), when in truth all you are doing is staying in the default position (uncertain
).
Rejecting the claim that a god exists is not the same as claim a god doesn't exist: it doesn't require a burden of proof because it's the default position. Agnosticism is the default position. The burden of proof is on the people making the claim.
Jump in the discussion.
No email address required.
Notes -
But that's not Bayesian. That's the whole point. And you accepted they use a single number to answer.
You: They use a single number for probabilities. They should use 2 like 50%+-20%
Me: Yes, they use a single number. No they shouldn't use 2 when they interpret probability as meaning subjective uncertainty. They should if they interpret it to mean something obejctive.
You: They don't learn from multiple coin tosses, they would need more than one number for that.
Me: They do learn. They use many numbers to compute.
You: They don't take uncertainty into account.
Me: They do, the probability is the uncertainty of the event.
You: 50%+-20% is analogous to saying "blue" whereas saying 50% is analogous to saying "sky blue".
Me: Not if probability means uncertainty. Then 50% maps to "blue", and 50%+-20% maps to nonsense.
You: My answer is correct.
Me: It depends on the question.
I'm not sure what's left here to discuss. I didn't get this follow up.
Right. Which is what that very sentence you half quoted explains.
They don't. The probability that the next coin flip is going to land heads is the same:
0/0
,50/50
,5000/5000
is0.5
. It does not get updated.No. It's not.
p=0.5
is not the uncertainty.I didn't say that.
Which is not the case.
There is no other question. I am asking a specific question, and the answer is
p=0.5
, there's no "it depends".p=0.5
is the probability the next coin flip is going to land heads. Period.I'm going to attempt to calculate the values for
n
number of heads and tails with 95% confidence so there's no confusion about "the question":0/0
:0.5±∞
5/5
:0.5±0.034
50/50
:0.5±0.003
5000/5000
:0.5±0.000
It should be clear now that there's no "the question". The answer for Bayesians is
p=0.5
, and they don't encode uncertainty at all.0/0: 0.5±∞
5/5: 0.5±0.034
50/50: 0.5±0.003
5000/5000: 0.5±0.000
If I ask for the probability that Putin is dead tomorrow, I'd say that fixes the date. You don't move "tomorrow" along with you so it never arrives. After the next coin flip happened, it either was heads or it wasn't, there's nothing left.
There is that word "probability" in the question, so of course how one interprets that word changes the question. If you disagree, give an argument. Instead, you are just repeating that your way of interpreting the word is the only way. I'd ask you to rephrase the question without using the words "the probability/chances/odds" or any such synonym. Then ask how a Bayesian would answer that version of the question, and see if the disagreement persists.
I know the definitions of probability, I know what probability is according to a Bayesian, I know what a likelihood function is, and I know what the actual probability of this example is, because I wrote a computer simulation with the actual probability embedded in it.
You are just avoiding the facts.
You know what probability is according to a Bayesian, and you think they are factually wrong. The rest of of the problems stem from that. I'd suggest at least you focus your arguments to why you think they are objectively wrong. Instead, you inject your understanding of probability into their statements and conclude factually wrong things like how they don't consider uncertainty when they do.
Then a Bayesian would be willing to answer the question of what your that parameter you embedded in your simulation is, with answers like beta(51,51).
That is not what I'm saying.
False. You know what they answer, and it's a single number.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Uff, I even told you how it's done. It's like I just pressed "new chat" on ChatGPT. Re-read or go Google "Bayesian inference coin flipping". It doesn't get more basic that that. I'm moving on, there's no progress to be made.
Show me step by step, I'll show you where you are wrong.
Say p∽beta(1, 1). Got 50/50 heads? Apply bayes rule, get posterior p∽beta(51,1), so next toss prob of heads went from 50% to 51/52 ~ 98%
Wrong. It's
beta(51,51)
. It'sbeta(heads+1,tails+1)
.I understood 50/50 to mean 50 heads out of 50 attempts.
You said: "it's not just about the answer is given, it's about how the answer is encoded in your brain."
Good. If it's about their brains, it went from beta(1, 1) --> ... -> beta(51, 51). They learned.
If it was just about the answer (it's not), then even your improbable hypothetical of 50 heads out of 100 tosses fails, since after every odd number of tosses, the answer is not 50%. But hey, you can always cherry pick it further and establish they clone the coin and throw it 100 times at once. And you'll have shown that they are able to not learn for a weird definition of learning that only cares about changes in the answer to the specific set of different but similar questions (1st toss outcome vs 100th toss outcome).
No. A Bayseian doesn't answer
beta(51, 51)
, he answers0.5
.More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
This is false. Bayesian calculations are quite capable of differentiating between epistemic and aleatory uncertainty. See the first Google result for "Bayes theorem biased coin" for an example.
(edit to add: not a mathematically perfect example; the real calculations here treat a bias as a continuous probability space, where a Bayesian update turns into an integral equation, and instead discretizing into 101 bins so you can use basic algebra is in the grey area between "numerical analysis" and "cheating".)
Did you actually read that? It clearly says:
It's a single value.
I looked at the code of the simulation:
It's a single value.
I printed the variable at the end of the simulation:
It's a single value.
I read, and I understood, and I also looked at the graphs discretized over hundreds of values, and I'm able to understand that when a probability distribution has a mean value in ℝ, that does not mean that the probability distribution itself lies in ℝ, no matter how many times you repeat "single value" after you finish downvoting me. You seem to believe that any space on which a functional can give a single value is a one-dimensional space? This is again false.
Let's try going through an example more slowly:
A uniform prior on an unknown coin bias θ∈[0,1] (p₀≡1) will marginalize to a probability of 1/2 (∫p₀(θ)θdθ) for the next coin flip being heads, for example, and that mean is the exact same "single value" as a delta function (pᴅ≡δ(θ-1/2)) at p=1/2, for example (∫δ(θ-1/2)θdθ), but the delta function will give p=1/4 for the next two flips in a row being heads (∫δ(x-1/2)θ²dθ) and the uniform prior will give p=1/3 (∫p₀(θ)θ²dθ).
Do a Bayesian update on the uniform prior after one flip of heads (pʜ(θ) = θp₀(θ)/∫φp₀(φ)dφ = 2θ), and it'll now say that p(heads next) is 2/3 (∫pʜ(θ)θdθ) and p(2 heads in the next 2 flips) is 1/2 (∫pʜ(θ)θ²dθ); do the same update on the delta function prior and it'll say that the next p(heads) is still 1/2 and the next p(2 heads) is still 1/4, because updating the delta function to θpᴅ(θ)/∫φpᴅ(φ)dφ just gives back the exact same delta function. They started with the exact same "single value", but they updated in different ways because that single value was just an integral of an entire probability distribution function, and Bayesian analysis saves the whole distribution to bring to an update.
Do another Bayesian update after a flip of tails following the flip of heads, and the originally-uniform prior will be back to p(heads next)=1/2 (for a third time, the exact same single value!)... but it won't be back to uniform (pʜᴛ(θ) = (1-θ)pʜ(θ)/∫(1-φ)pʜ(φ)dφ = 6θ(1-θ)); ask the posterior here for p(2 heads on the 3rd and 4th flip) and it'll be at 3/10; it's now squeezing probability mass much closer to fair than the uniform prior did, though it's not the same output at the delta function either.
The difference between a probability distribution and a marginalization of a probability distribution is really important. There's actually a kernel of truth to complaining here: doing Bayes on "the entire universe" is intractable, so in practice we always marginalize away "anything we don't expect to matter" (notice even when representing a uniform prior over coin biases I still haven't accounted for "the coin might land on its edge" or "someone might grab it in midair", or...) when updating, and we have to watch out for accidentally ignoring too much. But just because you have to marginalize away a lot doesn't mean you're required to marginalize away everything!
You're the guy from "2+2≠4" who was having trouble with equivalence classes and Shannon information, right? I was feeling bad about how many people were just downvoting you and moving on, and at that point I wasn't one of them, but now I kind of get it: downvoting falsehoods and moving on is fast, correcting falsehoods is slow, and I fear sometimes dissuading further falsehoods is impossible. I'd love to find out that you can now compute a posterior pʜᴛʜ, apologize to everyone you've been rude to, and start asking questions rather than making false assertions when there's something you don't understand, but if not, well, I guess that is what the downvote button is for after all.
I don't understand what would make you think I believe that.
Yes, so this confirms what I'm saying: the result is a single value. I understand what you are saying: the function that was used to generate this single value is not the same as the original function, but I don't think you are understanding what I am saying.
You are operating from a bottom-up approach: you claim you have a perfectly rational method to arrive to the value of
p=0.5
. This method integrates all the observed evidence (e.g 0/0, or 1/1 head/tails), so any new evidence is going to adjust the value correctly, for example 1 head is going to adjust the value differently if there have been 1/1 head/tails versus 10/10 head/tails.I do not disagree that this method arrives to a mathematically correct
p=0.5
.But I'm operating from the other side: the top-down approach. If I need to make a decision,
p=0.5
does not tell me anything. I don't care if the process that was used to generatep=0.5
is "mathematically correct", I want to know how confident should I be in concluding thatp
is indeed close to0.5
.If in order to make that decision I have to look at the function that was used to generate
p=0.5
, then that shows thatp=0.5
is not useful, it's the function that I should care about, not the result.My conclusion is the same:
p=0.5
is useless.Now, you can say "that's what Bayesians do" (look at the function, not the final value), but as I believe I already explained: I debated Scott Alexander, and his defense for making a shitty decision was essentially: "I arrived to it in a Bayesian way". So, according to him if
p=0.9
, that was a "good" decision, even if turned out to be wrong.So, maybe Scott Alexander is a bad Bayesian, and maybe Bayesians don't use this single value to make decisions (I hope so), but I've seen enough evidence otherwise, so I'm not convinced Bayesians completely disregard this final single value. Especially when even the Stanford Encyclopedia of Philosophy explicitly says that's what Bayesians do.
I was not the one having trouble, they were. Or do you disagree that in computing you can't put infinite information in a variable of a specific type?
You are making the assumption that they are falsehoods, when you have zero justification for believing so, especially if you misinterpret what is actually being said.
It's the straightforward interpretation of
If you wanted to say "they don't encode uncertainty-about-uncertainty in the number 0.5", and not falsely imply that they don't encode uncertainty at all (0.5 is aleatory uncertainty integrated over epistemic uncertainty!) or that they don't encode all their uncertainty anywhere, you should have made that true claim instead.
You said of "They use many numbers to compute",
This is flatly false. I just gave you two examples, still at the "toy problem" level even, the first discretizing an infinite-dimensional probability measure and using 101 numbers to compute, the second using polynomials from an infinite-dimensional subspace to compute!
You said,
Which is false, because it ignores that the uncertainty is retained for further updates. That uncertainty is also typically published; that posterior probability measure is found in all sorts of papers, not just those graphs you ignored in the link I gave you. I'm sorry if not everybody calling themselves "Bayesian" always does that (though since you just ignored a link which did do that, you'll have to forgive me for not taking your word about it in other cases).
You said,
This is false. p=0.5 is what you need to combine with utilities to decide how to make an optimal decision without further data. If you have one binary outcome (like the coin flip case) then a single scalar probability does it, you're done. If you have a finite set of outcomes then you need |S|-1 scalars, and if you have an infinite set of outcomes (and/or conditions, if you're allowed to affect the outcome) you need a probability measure, but these are not things that Bayesians never do, they're things that Bayesians invented centuries ago.
This is trivially true in the end with any decision-making system over a finite set of possible decisions. You eventually get to "Do X1" or "Do X2". If you couldn't come up with that index as your result then you didn't make a decision!
If maximizing expected utility, you get that value from plugging marginalized probabilities times utilities and finding a maximum, so you need those probabilities to be scalar values, so scalar values is usually what you publish for decision-makers, in the common case where you're only estimating uncertainties and you're expecting others to come up with their own estimates of utilities. If you expect to get further data and not just make one decision with the data you have, you incorporate that data via a Bayesian update, so you need to retain probabilities as values over a full measure space, and so what you publish for people doing further research is some representation of a probability distribution.
Your title was literally "2 + 2 is not what you think", and as an example you used [2]+[2]=[4] in ℤ/4ℤ (with lazier notation), except you didn't know that there [0]=[4] so you just assumed it was definitively "0", then you wasted a bunch of people's time arguing against undergrad group theory.
What I disagreed with was
And I disagreed with it because it was flatly false! The foundation of information theory is I = -log(P); this is only I=1 (bit) if P=1/2, i.e. equal probability in the case of 1 bit of data. I gave you a case where I=1/6, and a more complicated case where I=0.58 or I=1.58, and you flatly refuted the latter case with "it cannot be more". It can. -log₂(P) does exceed 1 for P<1/2. If I ask you "are you 26 years old", and it's a lucky guess so you say "yes", you've just given me one bit of data encoding about 6 bits of information. The expected information in 1 bit can't exceed 1 (you're probably going to say "no" and give me like .006 bits of information), but that's not the same claim; you can't even calculate the expected information without a weighted sum including the greater-than-1 term(s) in the potential information.
Distinctions are important! If you want to talk like a mathematician because you want to say accurate things then you need to say them accurately; if you want to be hand-wavy then just wave your hands and stop trying to rob mathematics for more credible-sounding phrasing. The credibility does not rub off, especially if instead of understanding the corrections you come up with a bunch of specious rationalizations for why they have "zero justification".
It's not.
No, you didn't. You showed how they could do it, that doesn't necessarily imply that's what they actually do in the real world.
Wrong. I'm not talking about mathematics or statistics, I'm talking about epistemology.
You don't seem to know what subject we are even talking about: bayesian epistemology. How one particular rationalist makes a decision is not published.
Even more proof that people don't even listen to what is being actually said.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link