Exploring a new dataset of confidence calibration assessments
It’s really difficult to create models that can characterize the randomness of any interesting system.
We can’t even begin to write down all of the events relevant to predicting who’s going to win the World Series, the price at which the stock market will open at tomorrow, or whether it will rain on Saturday.
When faced with complexity on this scale, we’re forced to make subjective probability judgements.
The probabilities we come up with are the product of our opinions and experience rather than formal mathematical models.
It’s possible, and often useful, to use statistical models to help derive subjective probabilities (as in Nate Silver’s election predictions) but these models and their results are ultimately the products of human judgement.
Setting subjective probabilities that accurately estimate the likelihood of events is critical for success in wide range of fields with imperfect information, including investing, medical diagnosis, and political judgments Subjective probabilities can even be at the heart of life-or-death decisions, such as the reliability of eyewitness testimony in criminal trials (“I’m 95% sure that was the guy”) and forecasts from intelligence community (“Iraq definitely possesses weapons of mass destruction”).
Because making well-calibrated judgments can be critically important, there has been a lot of research on how to improve them.
Join 30,000+ people who read the weekly 🤖Machine Learnings🤖 newsletter to understand how AI will impact the way they work and live.
One type of exercise researchers have used to test subjective probability judgements is a “calibrated probability assessment,” which involves asking subjects to answer questions and to indicate how confident they are in their answers.
Three years ago, Michael Mauboussin and I created a web-based probability calibration assessment inspired by the work of Philip Tetlock, a psychologist at the University of Pennsylvania who is a leading researcher on judgment. The assessment involves answering 50 true/false questions and expressing confidence in each answer.
The confidence values range from 50% (“I have no idea what the answer is”) to 100% (“I got this”) rounded to the nearest 10 percentage point (so the options are 50%, 60%, 70%, 80%, 90%, and 100%). The goal of the assessment is not necessarily to get every question right, even though it’s great if you do.
The goal is to express confidence that accurately reflects the probability that you will get each question correct. It’s hard to conclude much from the accuracy of any individual prediction, but a well-calibrated forecaster will be correct 60% of the time on the questions that she was 60% confident on, correct 70% on the questions she was 70% confident on, and so on. The quiz gives you an accuracy report once you finish it.
Since its release, 11,000 people have answered at least 48 of the 50 questions on the assessment and submitted their results. So we have lots of data on how well calibrated these people are. This post goes through those results. Take the assessment now if you haven’t already so that you avoid biasing your answers!
The main takeaway is that people tend to be overconfident. The mean confidence was 70%, but the mean correctness was only 60%. The mean Brier score was .42.
This chart shows the average correctness for each confidence value. The grey line represents perfect calibration.
The same data in a table:
n_users ~ 11,000
n_responses ~ 550,000
These results are consistent with academic work on confidence calibration. For example, Stanovich, Toplak, and West gave a similar assessment to 200 Amazon Mechanical Turk users as part of their work on rational thinking. They found an average overconfidence of 9.1%, very close to our 10% result (see chapter 8 of their book, The Rationality Quotient, for more details).
What’s behind our overconfidence? This assessment is really testing two things. First, are you overestimating your knowledge? In other words, is your gut feeling of whether or not your answer is correct itself correct? If you are *sure* you remember your high school physics teacher telling you a neutron has a positive charge, you’re making a mistake here (or had a really bad physics teacher).
Second, can you translate the state of your knowledge into a probability? The assessment requires mapping your feeling of confidence onto continuum of probabilities from 50–100%. It’s possible that subjects can consistently estimate how much they know, but that they make mistakes in translating those estimates into numerical values. Note that these errors will lead to probability judgements that are directionally correct (e.g., questions answered with 70% confidence will be correct more often than those answered with 60% confidence) despite being inaccurate (e.g., respondents are only correct 60% of the time they are 70% confident).
In statistics jargon, the first error is a source of variance and the second is a bias. They both play a role in probability judgements that are miscalibrated. We’ll now look at some of the theories that psychologists have come up with to explain these sources of overconfidence.
Why do we overestimate our knowledge?
The most popular explanation is a variant of confirmation bias. Confirmation bias is the tendency to reduce our mental costs by focus on information that confirms our existing hypothesis at the expense of information that might provide evidence against that hypothesis.
Our default mode for estimating the validity of a hypothesis is seeing how many facts we can think of in support our view, rather than carefully considering alternative views.
Researchers can exploit this disposition to guide their subjects into logical contradictions.
For instance, in one experiment people tended to say they were happier when asked, “Are you happy with your social life?” (this question brings to mind positive examples, such as the fun party you went to last weekend) than when they were asked the complement, “Are you unhappy with your social life” (which directs your attention toward less cheerful thoughts) (Kunda et al. 1993).
In the case of True/False questions, we fall prey to confirmation bias by taking “ownership” of the first answer that comes to mind after a split and then favoring evidence that supports it (page 164 of Stanovich, West, and Toplak 2016).
Here’s an example. One of the questions on the assessment asks whether Milan, Italy has a higher latitude (is further north) than Toronto, Canada. What’s the very first thing that comes to your mind after you read the question? If you’re like most people it’s something like this:
Canada: cold, snow, mountains.
Italy: Mediterranean, vineyards, river taxis.
If you don’t know very much about either Toronto or Milan, your split-second impression could come from their respective countries as depicted above. An automatic, and logical, rule of thumb for determining “which is further north” involves picking the city that would seem colder. Toronto is the more intuitive answer.
This fast judgement creates a sense of “ownership” over the “Toronto is further north” hypothesis and biases the slow cognition that follows. A fact supporting the intuitive hypothesis (“Canada is the furthest north place in North America”) is more likely to play a role in your judgement than counterfactual information (“actually, Toronto isn’t actually that cold, maybe it’s not that far north”). The result is that we place too much confidence in the intuitive answer.
You know what’s coming now. Milan *is* further north than Toronto, and the data supports the idea that the intuitive judgements of our subjects were often wrong for this question. In this case, there is actually a negative correlation between reported confidence and correctness. In other words, the more conviction a respondent had, the more likely that they were wrong:
This chart shows all ~11,000 responses to this question grouped by their confidence values. Respondents who reported having no relevant information about the questions (50% confidence) were correct 49% of the time, whereas the group that was 100% confident were only correct 20% of the time.
This Milan-Toronto question created a particularly large amount of overconfidence because the intuitive answer is wrong. Most of the time, the average person’s intuition isn’t as misleading. For example, when answering the question “Australia is larger in area than Brazil,” people seem to take ownership of each answer around 50% of the time, and are subsequently correct between 45–55% of the time at every level of confidence:
For other questions, the most common intuitive answer is correct and confidence has a strong correlation with correctness:
The data also show a tendency to guess “true” for questions that subjects didn’t know the answers to. When respondents reported having little to no information (50% confidence), they guessed “true” around 55% of the time (n>200,000).
Since the answers are roughly balanced between true and false, this means that subjects answered questions with the answer “true” correctly more often. This effect may also be caused by confirmation bias: lacking any strong intuitive response, quiz-takers could be “taking ownership” of the question statement at a higher rate than its complement.
Although it’s also possible that people gravitate toward “true” for some other reason. Maybe true/false questions are true more often than not, and people have learned to tilt their answers in that direction.
Even the most self-aware decision-makers are susceptible to confirmation bias, but there are some ways to counteract it.
One strategy is to explicitly “take ownership” of both sides before answering the question. In the case of the calibration assessment, this would involve gauging how you feel about both sides before settling on a confidence level.
This strategy requires extra self-awareness and time, but it can lead to more accurate predictions (Chapter 5 of Tetlock’s book, Superforecasting, describes this forecasting strategy in more detail).
Why can’t we come up with accurate subjective probabilities?
Some people have a grasp of frequentist statistics and are able to reason about the likelihood of events that recur many times by looking at past outcomes.
But both frequentist statistics (and our intuition) break down when it comes to the probability of single events. Amos Tversky, in conversation with Philip Tetlock, half-joked that most people only have three settings when thinking about single event probabilities: “gonna happen”, “not gonna happen”, and “maybe” (Superforecasting chapter 6).
This “three setting” mindset is very common in discourse that involves probabilities. For instance, pundits often treat Nate Silver’s predictions as binary judgements rather than probabilistic forecasts. So he was hailed for his accuracy in “correctly calling” all 50 states in the 2012 presidential election, and then lambasted for giving Hillary Clinton a 70% chance of winning in 2016.
In a disturbing example, Donald Trump exhibited the “three setting” bias when discussing the U.S.’s ability to defend against a nuclear weapon launched from North Korea. Referring to a countermeasure purported to be effective 97% of the time, he remarked: “if you send two of them, it’s going to get knocked out.”
For most of life, the three setting approach works perfectly fine. It doesn’t really matter if you truly understand the difference between a 70% or 80% chance of rain tomorrow. You should bring an umbrella either way. But for some activities it’s crucially important — including poker, sports betting, major national security decisions, and confidence calibration assessments.
People answering true/false questions fall into a “two setting” mindset when deciding on their confidence level: they either think that they know the answer (and select 100%) or that they have no idea (and select 50%). And since people have less intuition about values in the 60–90% range, performance doesn’t vary as much between them.
For example, people were correct 55% of the time when they were 60% confidence, which isn’t too bad. But at 70% confident, they were right only 57% of the time. The 10% bump in confidence came only with a 2% bump in correctness.
The “two setting” theory is corroborated by the distribution of confidence values. Respondents expressed 100% confidence much more frequently than anything in the 60–90% range, even though they were only right 78% of the time they did so:
(One caveat: the confidence selector on the website defaulted to 50%, which may have skewed the results.)
It’s possible to learn the difference between 60% and 70% events, but it takes a lot of practice. Simply making predictions isn’t enough to become calibrated, you also need to receive timely feedback on those predictions. Professional poker players play hundreds of thousands of hands over their careers and receive feedback immediately after each one, so over time they become very well calibrated on their win probability assessments. But this isn’t typical. Experts in most fields don’t produce precise predictions or get timely feedback and never become calibrated. But there is excellent evidence that such feedback improves calibration. In one study, subjects forecasted the probability of different weather events occurring over a six hour period. After a year of receiving feedback on their forecasts, their forecasts were substantially more accurate:
Part 3 of Daniel Kahneman’s Thinking, Fast and Slow has a more detailed discussion on the many pitfalls of subjective predictions.
As an aside, one of the reasons that statistical methods such as logistic regression are so effective is because they receive feedback and update their parameters on each prediction they make during training, similar to the poker player. They are therefore able to output extremely well calibrated predictions (as long as the they aren’t given any problems too different from what they were trained on).
Ideally never absorb information without predicting it first. Then you can update both 1) your knowledge but also 2) your generative model.
The difficulty of estimating confidence is affected by knowledge
I measured overconfidence by looking at each respondent’s confidence and subtracting their average correctness. Unfortunately, this means that the same overconfidence value can mean different things depending on how good you are at answering true/false questions. To see why, imagine someone who had no idea what any of the answers were. It’s easy for them to achieve perfect calibration: they can express 50% confidence for every question. Perfect calibration is similarly easy for people with perfect knowledge, who can always select 100% confidence.
Pretty much everyone falls between these two extremes, but the more questions for which you have complete knowledge (or zero knowledge) the easier it is to estimate your confidence. An ideal experiment would therefore control for subject’s knowledge on each of the questions (Parker and Stone 2014 describe the critique in more detail).
People may be cheating/not taking the assessment seriously
I filtered out people who didn’t complete at least 48 of the 50 questions, so each submission reflects a relatively substantial investment of time. I also removed the quizzes where all of the answers were correct (since this indicates cheating) or that never expressed more than 50% confidence.
Not Everyone is Overconfident
The story above is of systematic overconfidence. However, it’s important to note that all of the analysis reflects distributions and averages rather than individuals. Confirmation bias doesn’t always make people choose the more intuitive answer, it just makes it more likely that they will. And although the average response was overconfident, 15% of the people who took the assessment were actually under-confident. In other words, their correctness exceeded their confidence. The histogram below shows the distribution of the average confidence minus number of questions correct for all users:
And despite the general inclination toward “two setting” probabilistic thinking, around 25% of people got every question right when they were 100% confident.
There is also some research exploring correlations between confidence calibration assessment scores and other traits. Narcissistic people tend to be more overconfident and depressed people tend to be less overconfident. The only personal data I have is from Google Analytics so I just looked at that. First, mean overconfidence by operating system:
OS Mean Overconfidence N
Linux 7.69% 157
Windows 9.46% 4979
Macintosh 9.95% 1658
iOS 10.12% 2842
Android 11.13% 1287
Mobile users are more overconfident, possibly because they take less time to answer the questions and are less careful (unfortunately I don’t have data on completion times). Next, I look at mean overconfidence by country (N > 100):
Country Mean Overconfidence N
United States 9.33% 6235
Canada 9.39% 828
Germany 9.43% 145
Brazil 10.10% 207
Australia 10.20% 609
United Kingdom 10.58% 788
Sweden 11.18% 176
Switzerland 11.57% 115
Netherlands 12.11% 372
India 12.94% 257
The United States is the least overconfident when the results are broken down by country, but this effect may be because some of the questions are somewhat US-centric (e.g. “A U.S. quart is equivalent to 24 fluid ounces,” “The capital of New Mexico is Albuquerque”) and it is easier to avoid overconfidence if you have more knowledge of the questions.
We collected the results of 11,000 confidence calibration exercises. The results show an average overconfidence of 10% and are consistent with other research on confidence calibration. Please reach out at firstname.lastname@example.org if you have any you are interested in the data set or collaborating on further analysis!