Stanford Encyclopedia of Philosophy
Browse
Table of Contents
What's New
Random Entry
Chronological
Archives
About
Editorial Information
About the SEP
Editorial Board
How to Cite the SEP
Special Characters
Advanced Tools
Contact
Support SEP
Support the SEP
PDFs for SEP Friends
Make a Donation
SEPIA for Libraries
Entry Contents
Bibliography
Academic Tools
Friends PDF Preview
Author and Citation Info
Back to Top
Simpson’s ParadoxFirst published Wed Mar 24, 2021; substantive revision Sat Jun 6, 2026
Simpson’s Paradox is a statistical phenomenon where an
association between two variables in a population emerges, disappears
or reverses when the population is divided into subpopulations. For
instance, two variables may be positively associated in a population,
but be independent or even negatively associated in all
subpopulations. Cases exhibiting the paradox are unproblematic from
the perspective of mathematics and probability theory, but
nevertheless strike many people as surprising. Additionally, the
paradox has implications for a range of areas that rely on
probabilities, including decision theory, causal inference, and
evolutionary biology. Finally, there are many instances of the
paradox, including in epidemiology and in studies of discrimination,
where understanding the paradox is essential for drawing the correct
conclusions from the data.
The following article provides a mathematical analysis of the paradox,
explains its role in causal reasoning and inference, compares theories
of what makes the paradox seem paradoxical, and surveys its
applications in different domains.
1. Introduction
2. Definition and Mathematical Characterization
2.1 Varieties of Simpson’s Paradox
2.2 Necessary and Sufficient Conditions
3. Simpson’s Paradox and Causal Inference
3.1 Probabilistic Causality and Simpson’s Paradox
3.2 Specific Debates: Causal Interaction, Average Effects, Mediators
3.3 DAGs and Causal Identifiability
3.4 Confounding and Pearl’s Analysis of the Paradox
3.5 Implications
4. What Makes Simpson’s Paradox Paradoxical?
5. Applications
5.1 Non-Categorical Data and Linear Regression
5.2 Epidemiology and Meta-Analysis
5.3 Decision Theory and the Sure-Thing Principle
5.4 Philosophy of Biology and Natural Selection
5.5 Policy Questions: Interpreting Data on Discrimination
5.6 Using Statistics to Evaluate Task Performance
6. Conclusions
Bibliography
Academic Tools
Other Internet Resources
Related Entries
1. Introduction
We begin with an illustration of the paradox with concrete data. The
numbers in
Table 1
summarize the effect of a medical treatment for the overall
population (N = 52), and separately for men and women:
Full Population, \(\bf N=52\)
Men \(\bf(\r{M})\), \(\bf N=20\)
Women \(\bf(\neg \r{M})\), \(\bf
N=32\)
Success \(\bf(\r{S})\)
Failure \(\bf(\neg \r{S})\)
Success Rate
Success
Failure
Success Rate
Success
Failure
Success Rate
Treatment (T)
20
20
50%
8
5
≈ 61%
12
15
≈ 44%
Control
(¬T)
6
6
50%
4
3
≈ 57%
2
3
≈ 40%
Table 1: Simpson’s Paradox: the
type of association at the population level (positive, negative,
independent) changes at the level of subpopulations. Numbers taken
from Simpson’s original example (1951).
For matters of exposition, we assume that these frequencies are
unbiased estimates of the underlying probabilities. The treatment
looks ineffective at the level of the overall population, but it leads
to higher success percentages than the control both for men and for
women (61% vs. 57% for men and 44% vs. 40% for women). Writing these
proportions as conditional probabilities, with \(\r{T}\)=treatment,
\(\r{S}\)=success/recovery, and \(\r{M}\)=male subpopulation, we
obtain
\[ p(\r{S}\mid \r{T}) = p(\r{S}\mid \neg \r{T}) \]
but at the same time,
\[\begin{align*}
p(\r{S}\mid \r{T}, \r{M}) & \gt p(\r{S}\mid \neg \r{T}, \r{M} ) \\
p(\r{S}\mid \r{T}, \neg \r{M}) &\gt p(\r{S}\mid \neg \r{T}, \neg \r{M})
\end{align*}\]
Should we use the treatment or not? When we know the gender of the
patient, we would presumably administer the treatment, whereas it does
not look like the right thing to do when we don’t know the
patient’s gender—although we know that the patient is
either male or female!
This phenomenon was first pointed out in papers by Karl G. Pearson
(1899) and George U. Yule (1903), but it was Simpson’s short
paper “The interpretation of interaction in contingency
tables” (1951), discussing the interpretation of such
association reversals, that led to the phenomenon being labeled as
“Simpson’s Paradox”. The phenomenon is, however,
broader than independence in the overall population and positive
association in the subpopulations; for example, the associations may
also be reversed. Nagel and Cohen (1934: ch. 16) provide an example of
such a reversal as part of a exercise for logic students.
Understanding the paradox is essential for drawing the proper
conclusions from statistical data. To give a recent example involving
the paradox (Kügelgen, Gresele, & Schölkopf 2021), early
data revealed that the case fatality rate for Covid-19 was higher in
Italy than in China overall. Yet within every age group the fatality
rate was higher in China than in Italy. One thus appears to get
opposite conclusions about the comparative severity of the virus in
the countries depending on whether one compares the whole populations
or the age-partitioned populations. Having a proper analysis of what
is going on is such cases is thus crucial for using statistics to
inform policy.
In what follows,
Section 2
explains different varieties of the paradox, clarifies the logical
relationships between them, and identifies precise conditions for when
the paradox can occur. While that section focuses on the mathematical
characterization of the paradox,
Section 3
focuses on its role in causal inference, its implications for
probabilistic theories of causality, and its analysis by means of
causal models based on directed acyclic graphs (DAGs: Spirtes,
Glymour, & Scheines 2000; Pearl 2000 [2009]).
Based on these different approaches,
Section 4
discusses different analyses of what makes Simpson’s Paradox
look paradoxical, and what kind of error it reveals in human
reasoning. This section also reports empirical findings on the
prevalence of the paradox in reasoning and inference.
Section 5
surveys the occurrence and interpretation of the paradox in applied
statistics (regression models), philosophy of biology, decision theory
and public policy. For example, Simpson’s Paradox is relevant
when analyzing data to test for race or gender discrimination (Bickel,
Hammel, & O’Connell 1975).
Section 6
wraps up our findings and concludes.
2. Definition and Mathematical Characterization
This section shows how Simpson’s Paradox can be characterized
mathematically, under which conditions it occurs, and how it can be
avoided. We begin by further considering the concrete example from the
introduction in order to build intuitions that will guide us through
the more technical results.
The data in
Table 1
can be translated into success or recovery rates, showing that
treated men have a higher recovery rate than untreated men (roughly
61% vs. 57%), and the same for women (44% vs. 40%). Two observations
are key to understanding why this positive association vanishes in the
aggregate data. First, the recovery rate of untreated men is still
higher than the recovery rate of women who receive treatment (57% vs.
44%), suggesting that not only treatment, but also gender is a
relevant predictor of recovery. Second, while the treatment group is
majority female (27 vs. 13), the control group is majority male (7 vs.
5). Speaking informally, the lack of population-level correlation
between treatment and recovery results from men being both (i) more
likely to recover from the treatment, and (ii) less likely to be in
the treatment group.
This becomes evident when we use conditional probabilities to
represent recovery rates given treatment and/or subpopulation. The
overall recovery rates given treatment and control can, by the Law of
Total Probability, be written as the weighted average of recovery
rates in the subpopulations:
\[\begin{align*}
p(\r{S}\mid \r{T}) &= p(\r{S}\mid \r{T},\r{M}) p(\r{M}\mid \r{T}) + p(\r{S}\mid \r{T}, \neg \r{M}) p(\neg \r{M}\mid \r{T}) \\
p(\r{S}\mid \neg \r{T}) &= p(\r{S}\mid \neg \r{T},\r{M}) p(\r{M}\mid \neg \r{T}) + p(\r{S}\mid \neg \r{T}, \neg \r{M}) p(\neg \r{M}\mid \neg \r{T})\end{align*}\]
Plugging in the numbers from
Table 1
to calculate the overall recovery rates via these equations, we see
that the first line is a weighted average of success rates for treated
men and women (61% and 44%) while the second line is a weighted
average of success rates of the two control groups (57% and 40%).
These averages are weighted by the percentage of males and females in
each group, and in the present case the gender disparity between the
groups results in both averages being 50%. Since these weights can be
different, the treatment may raise the probability of success among
males and females without doing so in the combined population.
Later we will show that the positive association in the subpopulations
cannot vanish if the correlation of treatment with gender is broken
(e.g., by balancing gender rates in both conditions). The weights in
each line are then identical—\(p(\r{M}\mid \r{T}) = p(\r{M}\mid
\neg \r{T})\)—and associations in subpopulations are preserved
for the aggregate data
(Theorem 1 in Section 2.2).
In fact, the absence of such a correlation rules out Simpson’s
Paradox.
In what follows, we interpret Simpson’s paradox as a property of
association between variables, expressed by conditional probabilities.
This perspective is not uncontentious. For Spanos (2021), it amounts
to a deductive take on the paradox, contrary to the original
understanding of the paradox as a challenge for inductive,
statistical learning. In Pearson’s and Yule’s original
papers (and in applications involving linear regression models, see
Section 5.1),
the paradox is about models and data: estimating a model parameter
yields spurious correlations that vanish when one refines the model
and includes further variables. For Spanos, the paradox emerges as a
consequence of statistical model misspecification. We get back to this
statistics-centered perspective on Simpson’s paradox at the end
of
Section 4.
2.1 Varieties of Simpson’s Paradox
Simpson’s Paradox can occur for various types of data, but
classically, it is formulated with respect to \(2\times2\) contingency
tables. Let \(D_i = (a_i, b_i, c_i, d_i)\) be a four-dimensional
vector of real numbers representing the \(2\times2\) contingency table
for treatment and success in the i-th subpopulation, and
let
\[D = \sum_{i=1}^N D_i = \left(\sum a_i, \sum b_i, \sum c_i, \sum d_i\right)\]
be the aggregate data set over \(N\) subpopulations. These data should
be read as shown in
Table 2.
Population \(\bf \i{D} =
\i{D}_1+\i{D}_2\)
Subpopulation \(\bf \i{D}_1\)
Subpopulation \(\bf \i{D}_2\)
Success (\(\bf \i{S}\))
Failure (\(\bf \neg \i{S}\))
Success (\(\bf \i{S}\))
Failure (\(\bf \neg \i{S}\))
Success (\(\bf \i{S}\))
Failure (\(\bf \neg \i{S}\))
Treatment (\(\bf \i{T}\))
\(a_1 + a_2\)
\(b_1 + b_2\)
\(a_1\)
\(b_1\)
\(a_2\)
\(b_2\)
No Treatment (\(\bf\neg \i{T}\))
\(c_1 + c_2\)
\(d_1 + d_2\)
\(c_1\)
\(d_1\)
\(c_2\)
\(d_2\)
Table 2: Abstract representation of a
\(2 \times 2\) contingency table with subpopulations \(D_1\) and
\(D_2\).
Let \(\alpha (D_i)\) be a measure the strength of the probabilistic
association between \(T\) and \(S\) in population
\(D_i\).[1]
By convention, \(\alpha (D_i) = 0\) corresponds to no association
between the variables, \(\alpha (D_i) \gt 0\) indicates a positive
association, and \(\alpha (D_i) < 0\) a negative one. This can best
be translated into the condition
\[\begin{align*}
\tag{1}
\alpha (D_i) &
\begin{cases}
> 0 & \qquad \text{if and only if} \qquad a_i \, d_i > b_i \, c_i; \\
= 0 & \qquad \text{if and only if} \qquad a_i \, d_i = b_i \, c_i; \\
> 0 & \qquad \text{if and only if} \qquad a_i \, d_i < b_i \, c_i.
\end{cases}\end{align*}\]
The condition \(a_i \, d_i > b_i \, c_i\) is equivalent to saying
that the success rate in the first row (“treatment
condition”) is higher than the success rate in the second row
(“control condition”):
\[ a_i/(a_i+b_i) > c_i/(c_i+d_i).\]
Applying all this to our dataset in
Table 1,
we see that \(\alpha(D) = 0\) although \(\alpha(D_1) > 0\) and
\(\alpha(D_2) > 0\). This is a special case of what Samuels (1993)
calls Association Reversal (AR). Association reversal
occurs if and only if there is a population such that the association
in all partitioned subpopulations is either (i) positive (ii)
negative, or (iii) zero, and the type of association in the population
does not match that of the subpopulations. Writing this out
mathematically, this means for a dataset \(D = \sum_{i=1}^N D_i\) that
one of the following two conditions holds,
\[\begin{align*}
\alpha(D) &\le 0 \qquad \text{and} & \alpha(D_i) &\ge 0 \qquad \forall \; 1 \le i \le N \tag{AR1}\\
\alpha(D) &\ge 0 \qquad \text{and} & \alpha(D_i) &\le 0 \qquad \forall \; 1 \le i \le N \tag{AR2}\end{align*}\]
where at least one of the inequalities has to be strict. Association
reversal is the standard variety of Simpson’s Paradox
(Bandyopadhyay et al. 2011; Blyth 1972, 1973) and also the one that is
most frequently investigated in the psychology of reasoning, or by
philosophers analyzing the paradox (e.g., Cartwright 1979; Eells 1991;
Malinas 2001).
An important special case of AR occurs when there is no association in
the subpopulations, but an association emerges in the overall
dataset:
\[\begin{align*}
\alpha(D_i) &= 0 \qquad \forall 1 \le i \le n \qquad \text{but} & \alpha(D) &\ne 0 \tag{YAP}\end{align*}\]
Referring to the pioneering work of the statistician George U. Yule
(1903: 132–134), Mittal (1991) calls this Yule’s
Association Paradox (YAP). It is typical of spurious
correlations between variables with a common cause, that is, variables
that are dependent unconditionally (\(\alpha(D) \ne 0\)) but
independent given the values of the common cause (\(\alpha(D_i) =
0\)). For example, sleeping in one’s clothes is correlated with
having a headache the next morning. However, once we stratify the data
according to the levels of alcohol intake on the previous night, the
association vanishes: given the same level of drunkenness, people who
undress before going to bed will have the same headache, ceteris
paribus, as those who kept their clothes on.
Finally, the most general version of Simpson’s Paradox is the
Amalgamation Paradox (AMP) identified by Good and
Mittal (1987). This paradox occurs when the overall degree of
association is bigger (or smaller) than each degree of association in
the subpopulations, or mathematically,
\[\begin{align*}
\alpha(D) &> \max_{1 \le i \le N} \alpha(D_i) \qquad \text{or} & \alpha(D) &< \min_{1 \le i \le N} \alpha(D_i). \tag{AMP}
\end{align*}\]
AMP challenges the intuition that the degree of association in the
general population, in virtue of being “the sum” of the
individual subpopulations, has to fall in between the minimal and the
maximal degree of association observed on that level. The logical
strength of the paradoxes is inversely related to their generality and
frequency of occurrence: \(\text{YAP} \Rightarrow \text{AR}
\Rightarrow \text{AMP}\). Variations of the paradox for
non-categorical data (e.g., bivariate real-valued data) will be
discussed in
Section 5.1.
2.2 Necessary and Sufficient Conditions
We proceed to characterizing the mathematical conditions under which
Simpson’s Paradox occurs. We have already suggested that the
paradox arises in the medical example due to correlations between the
treatment variable and the partitioning variable, and we can now make
this more precise:
Theorem 1 (Lindley & Novick 1981; Mittal 1991):
If \(\alpha(D) > 0\) and association reversal occurs for the
subpopulations characterized by attribute \(\r{M}\) and \(\neg\r{M}\),
(i.e., \(\alpha(D_1), \alpha(D_2) \le 0\)), then either
\(\r{M}\) is positively related to \(\r{S}\) and \(\r{T}\);
or
\(\r{M}\) is positively related to \(\neg\r{S}\) and
\(\neg\r{T}\).
As Theorem 1 makes clear, the lack of correlation between \(\r{M}\)
and \(\r{T}\) is sufficient to rule out association reversals (and
thus YAP as well). Does it also rule out the more general amalgamation
paradox? The answer to this depends on which measure of
association one chooses for \(\alpha\). Discussions of
Simpson’s Paradox commonly treat association as the
difference in the success rate between the treated and the
untreated, but this is only one of many possibilities (Fitelson 1999).
While the lack of association between \(M\) and \(T\) is sufficient to
rule out AMP for most measures (including the difference measure) it
does not rule it out for all measures, as we will now explain. Readers
not interested the specific details may skip to the following
section.
Here are some widely used association measures for a dataset \((a, b,
c, d)\):
\[\begin{align*}
\pi_{D} &= \frac{a}{a+b} - \frac{c}{c+d} & \pi_{Y} &= \frac{ad -bc}{N^2}\\
\pi_{R} &= \log \left(\frac{a}{a+b} \cdot \frac{c+d}{c} \right) & \pi_{W} &= \log \left(\frac{a}{a+c} \cdot \frac{b+d}{b} \right) \\
\pi_{O} &= \log \frac{ad}{bc} & \pi_{C} &= \log \left(\frac{d}{c+d} \cdot \frac{a+b}{a} \right) \end{align*}\]
Some of these measures can be formulated probabilistically and have
been suggested as measures of causal strength and outcome measures for
clinical trials (Edwards 1963; Eells 1991; Fitelson & Hitchcock
2011; Greenland 1987; Peirce 1884; Sprenger 2018; Sprenger &
Stegenga 2017). For example, \(\pi_{D} = p(\r{S}\mid \r{T}) -
p(\r{S}\mid \neg \r{T})\) represents the difference and \(\pi_R =
p(\r{S}\mid \r{T}) / p(\r{S}\mid \neg \r{T})\) the ratio of success
rates in treatment and control conditions. \(\pi_W\) can be
interpreted as the prognostic weight of evidence that treatment
provides for success (i.e., as the log-Bayes factor), \(\pi_{Y}\) is
Yule’s (1903) measure of association, \(\pi_{O}\) is the
log-odds ratio familiar from epidemiological data analysis, and
\(\pi_C\) is I.J. Good’s (1960) measure of causal
strength.
We now consider the extent to which AMP for different measures is
ruled out by different experimental designs. Suppose that individuals
are uniformly assigned to the treatment and control condition across
subpopulations. In such a case, where the ratio of persons assigned to
the treatment and control condition is equal for each subpopulation,
the experimental design is called row-uniform.
Specifically, there has to be a \(\lambda > 0\) such that for any
subpopulation i
\[ a_i + b_i = \lambda (c_i+d_i) \tag{Row Uniformity} \]
In particular, row uniformity holds approximately if our sample is
large and we sample at random from the population.
Row-uniform design of a trial ensures independence between a potential
confounder \(M\) and the treatment variable \(T\). Accordingly, by
Theorem 1,
it rules out association reversals. Additionally, row-uniform design
is sufficient to rule out the AMP for a wide class of association
measures:
Theorem 2 (Good & Mittal 1987): If a dataset \(D
= \sum D_{i}\) satisfies row uniformity, then the Amalgamation Paradox
is avoided for the measures \(\pi_{D}\), \(\pi_{R}\), \(\pi_{Y}\) and
\(\pi_{W}\) and \(\pi_{C}\). It is not avoided for the
log-odds ratio \(\pi_{O}\).
Some studies also exhibit column-uniform design where
the proportion of successes and failures is constant across all
subpopulations:
\[ a_i + c_i = \lambda (b_i+d_i) \tag{Column Uniformity} \]
Also then \(\r{M}\) is independent of \(\r{S}\). Column uniformity can
occur in case-control studies with various subpopulations (e.g.,
different hospitals) where one does not match the number of persons
with the explanatory attribute, like in an RCT. Instead, for each
person with a certain attribute (e.g., a specific form of cancer), one
selects a number of persons that does not have this attribute.
Column-uniform design avoids AR as well, but among the presented
association measures, it suffices to rule out AMP only for
\(\pi_Y\).
Association Measure
Avoids AMP?
\(\pi_{D}\)
\(\pi_{R}\)
\(\pi_{O}\)
\(\pi_{Y}\)
\(\pi_{W}\)
\(\pi_{C}\)
Row-uniform design
yes
yes
no
yes
yes
yes
Column-uniform design
no
no
no
yes
no
no
Both
yes
yes
yes
yes
yes
yes
Table 3: An overview of how row- and
column-uniform design avoid the amalgamation paradox for various
association measures.
Table 3
summarizes the properties of all association measures with respect to
the AMP and the different forms of experimental design. The behavior
of the log-odds measure \(\pi_O\), where neither row- nor
column-uniform design suffices to rule out the AMP, will be discussed
in
Section 5.2.
We now identify one last fundamental condition for when data exhibit
association reversal. Have a look at
Figure 1
which displays the success proportions for treatment and control
graphically.
Figure 1: A geometrical representation
of a necessary condition for the occurrence of Association Reversal.
The paradox can occur if the proportions are ordered like in the left
graph; it cannot occur if they are ordered like in the right graph.
[An
extended description of figure 1
is in the supplement.]
In both examples, the treatment success rate is for both
subpopulations greater than the control success rate. When will this
order be preserved at the overall level? We know that the overall
success rate for each condition (treatment/control) is
constrained by the success rates in the subpopulations:
Fact 1: Suppose \(a_i, b_i > 0\) for all \(1 \le i
\le N\). Then also
\[\begin{align*}\tag{2}
\min \frac{a_i}{a_i+b_i} \le \frac{\sum_{j=1}^N a_j}{\sum_{j=1}^N (a_j+b_j)} \le \max \frac{a_i}{a_i+b_i}
\end{align*}\]
This fact follows directly from the Law of Total Probability (proof
omitted) and it gives us a simple necessary condition for the
occurrence of Association Reversal (AR): turning to
Figure 1
again, it implies that the overall success rate per condition has to
be on the solid lines. Thus AR cannot occur in the right part
of
Figure 1,
but it can occur if the proportions are ordered as in the left part
of
Figure 1.
Generally, AR is avoided when the following condition holds:
\[\tag{RH}
\begin{align*}
\max_{1 \le i \le N} \frac{a_i}{a_i+b_i} & < \min_{1 \le i \le N} \frac{c_i}{c_i+d_i} \\
\text{ or } \hspace{5.5em}\\
\min_{1 \le i \le N} \frac{a_i}{a_i+b_i} & > \max_{1 \le i \le N} \frac{c_i}{c_i+d_i}
\end{align*}
\]
Any dataset that satisfies
(RH)
will be called row-homogenous. By contrast, for any
given set of proportions violating condition
(RH),
we can find datasets exhibiting these very same proportions such that
AR indeed occurs (by fiddling with the size of the subpopulations;
Lemma 3.1 in Mittal 1991). However, neither row homogeneity, nor the
analogous condition of column homogeneity, nor their conjunction is
sufficient for avoiding the amalgamation paradox AMP.
Finally, one might be interested in how frequently the paradox arises.
Simulations by Pavlides and Perlman (2009) suggest that it should not
occur frequently: the confidence interval for the probability of AR is
a subset of the interval \([0;0.03]\) for both the uniform prior and
the (objective) Jeffreys prior. Of course, the practical value of this
diagnosis depends on whether the sampling assumptions are sensible,
and whether the entire approach makes sense for real-life datasets
where researchers can group the data into subpopulations along
numerous dimensions.
3. Simpson’s Paradox and Causal Inference
Within the philosophical literature, Simpson’s Paradox received
sustained attention due to its implications for accounts of causality
that posit systematic connections between causal relationships and
probability-raising. Specifically, the paradox reveals that facts
about probability-raising will not necessarily be preserved when one
partitions a population into subpopulations. This poses a number of
important challenges to philosophical accounts of causal inference
based on the concept of probability:
What is the appropriate set of background factors for determining
when a probabilistic relationship is causal?
What do association reversals imply for causal inference?
Does Simpson’s Paradox threaten the objectivity of causal
relationships?
Strategies for treating the paradox and answering these questions have
contributed substantially to the development of theories of
probabilistic causality (Cartwright 1979; Eells 1991). A different set
of answers is provided by more recent work on the paradox in the
framework of graphical causal models (e.g., Pearl 1988, 2000 [2009];
Spirtes et al. 2000), and we will discuss both accounts in turn. In
particular, we will explain how Simpson’s Paradox can be
analyzed through the notions of confounding and the identifiability of
a causal effect.
3.1 Probabilistic Causality and Simpson’s Paradox
Early accounts of probabilistic causation (e.g., Reichenbach 1956;
Suppes 1970) sought to explicate causal claims purely in terms of
probabilistic and temporal facts. On Suppes’ (1970) account,
event \(\r{C}\) is a prima facie cause of \(\r{E}\) if and
only if (i) \(\r{C}\) occurs before \(\r{E}\) and (ii) \(\r{C}\)
raises the probability of
\(\r{E}\).[2]
As we have already seen in
Section 2.1,
not all prima facie causes are genuine causes. If I drink a
strong blond Belgian beer now, I will probably be happy during the
day, but also have a headache tomorrow. However, being happy would not
thereby by the cause of the headache: the correlation is explained by
the common cause—the beer drinking. The variable for drinking
the beer screens off the probabilistic relationship between
its effects, meaning that the effects will be uncorrelated when one
conditions on it. The crux of Suppes’ account is that a
prima facie causal relationship between \(\r{C}\) and
\(\r{E}\) is a genuine causal relationship iff there is no factor F
prior to C that screens off \(\r{C}\) from
\(\r{E}\).[3]
Later theorists such as Cartwright (1979) and Eells (1991) developed
this condition by making causal claims relative to a causally
homogenous background context, which is specified by a set of
variables \(\b{K}\). Consider the following example of association
reversal presented by Cartwright. Supposing that smoking \((\r{S})\)
is a cause of heart disease \((\r{H})\), one might expect that smoking
would raise the probability of heart disease. Yet this might not be
the case. Suppose that in a population there is a strong correlation
between smoking and exercising (X), and that exercise lowers the
probability of heart disease by more than smoking raises its
probability. In such a case, smoking might lower the probability of
heart disease although conditional on either \(X\) or \(\neg X\),
\(\r{S}\) raises \(\r{H}\)’s probability.
Cartwright interprets this case as follows: causes always raise the
probability of their effects, but this can be “concealed”
by the correlation between the cause and some other variable (here,
\(X\)). In order to isolate the genuine probabilistic relationship
between \(\r{C}\) and \(\r{E}\), one needs to consider it in a context
where such correlations cannot occur:
Probabilistic Causality (Cartwright) Let \(\b{K}\)
denote all and only the causes of \(\r{E}\) other than \(\r{C}\)
and effects of \(\r{C}\). Then \(\r{C}\) causes \(\r{E}\) if and
only if relative to all combinations of values variables in \(\b{K}\),
\(\r{C}\) raises the probability of \(\r{E}\): \(p(\r{C}\mid
\r{E},\b{K}) > p(\r{C}\mid \neg{\r{E},\b{K}})\).
While Suppes defends a reductive account of probabilistic
causality, where the elements of \(\b{K}\) are determined without
appeal to causal assumptions, Cartwright presents a
non-reductive account where \(\b{K}\) must include all and
only the causes of \(\r{E}\), excluding \(C\) itself and any variables
that are causally intermediate between \(\r{C}\) and \(\r{E}\). The
current consensus is that it is impossible to give a probabilistic
account of causation without relying and causal concepts, and thus
that no non-reductive account is feasible (though see Spohn 2012 for a
dissenting view).
Although non-reductive accounts could not be used to explain causation
to someone with no prior causal knowledge, they can nevertheless
clarify how causal claims are tested, and illuminate the relationship
between causation and probability (see also Woodward 2003:
20–22). Moreover, Cartwright argues that her general criterion
for inclusion of background factors in \(\b{K}\) avoids the reference
class problem for purely statistical accounts of causal explanation,
which arises when probabilistic facts arbitrarily depend on the way
one partitions a population into subpopulations. Through specifying
the relevant populations for evaluating causal claims, she aims to
eliminate a threat to the objectivity of causal explanation. More
detail is provided in the
entry on probabilistic causality.
3.2 Specific Debates: Causal Interaction, Average Effects, Mediators
Cartwright’s innovations for probabilistic accounts of causality
have triggered various debates related to Simpson’s Paradox. We
highlight three of them here:
Debate 1: Causal Interaction
Cartwright claims that causes raise the probabilities of their effects
across all background
contexts,[4]
but many purported causes only raise the probabilities of their
effects in some contexts. In the latter cases, causes
interact with background factors in producing their effects.
To give Cartwright’s own example (1979: 428), ingesting an acid
poison generally causes death, except in contexts where one also
ingests an alkali poison (in which case the two cancel one another
out). The problem of such interactive causes for probabilistic
accounts is that they threaten Cartwright’s picture on which t
Link preview
Friends of the SEP Society - Preview of Simpson's Paradox PDF
Stanford Encyclopedia of Philosophy Browse Table of Contents What's New Random Entry Chronological Archives About Editorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact Support SEP Support the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries Entry Contents Bibliography Academic Tools Friends PDF Preview Author and Citation Info Back to Top Simpson’s ParadoxFirst published Wed Mar 24, 2021; substantive revision Sat Jun 6, 2026 Simpson’s Paradox is a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations. For instance, two variables may be positively associated in a population, but be independent or even negatively associated in all subpopulations. Cases exhibiting the paradox are unproblematic from the perspective of mathematics and probability theory, but nevertheless strike many people as surprising. Additionally, the paradox has implications for a range of areas that rely on probabilities, including decision theory, causal inference, and evolutionary biology. Finally, there are many instances of the paradox, including in epidemiology and in studies of discrimination, where understanding the paradox is essential for drawing the correct conclusions from the data. The following article provides a mathematical analysis of the paradox, explains its role in causal reasoning and inference, compares theories of what makes the paradox seem paradoxical, and surveys its applications in different domains. 1. Introduction 2. Definition and Mathematical Characterization 2.1 Varieties of Simpson’s Paradox 2.2 Necessary and Sufficient Conditions 3. Simpson’s Paradox and Causal Inference 3.1 Probabilistic Causality and Simpson’s Paradox 3.2 Specific Debates: Causal Interaction, Average Effects, Mediators 3.3 DAGs and Causal Identifiability 3.4 Confounding and Pearl’s Analysis of the Paradox 3.5 Implications 4. What Makes Simpson’s Paradox Paradoxical? 5. Applications 5.1 Non-Categorical Data and Linear Regression 5.2 Epidemiology and Meta-Analysis 5.3 Decision Theory and the Sure-Thing Principle 5.4 Philosophy of Biology and Natural Selection 5.5 Policy Questions: Interpreting Data on Discrimination 5.6 Using Statistics to Evaluate Task Performance 6. Conclusions Bibliography Academic Tools Other Internet Resources Related Entries 1. Introduction We begin with an illustration of the paradox with concrete data. The numbers in Table 1 summarize the effect of a medical treatment for the overall population (N = 52), and separately for men and women: Full Population, \(\bf N=52\) Men \(\bf(\r{M})\), \(\bf N=20\) Women \(\bf(\neg \r{M})\), \(\bf N=32\) Success \(\bf(\r{S})\) Failure \(\bf(\neg \r{S})\) Success Rate Success Failure Success Rate Success Failure Success Rate Treatment (T) 20 20 50% 8 5 ≈ 61% 12 15 ≈ 44% Control (¬T) 6 6 50% 4 3 ≈ 57% 2 3 ≈ 40% Table 1: Simpson’s Paradox: the type of association at the population level (positive, negative, independent) changes at the level of subpopulations. Numbers taken from Simpson’s original example (1951). For matters of exposition, we assume that these frequencies are unbiased estimates of the underlying probabilities. The treatment looks ineffective at the level of the overall population, but it leads to higher success percentages than the control both for men and for women (61% vs. 57% for men and 44% vs. 40% for women). Writing these proportions as conditional probabilities, with \(\r{T}\)=treatment, \(\r{S}\)=success/recovery, and \(\r{M}\)=male subpopulation, we obtain \[ p(\r{S}\mid \r{T}) = p(\r{S}\mid \neg \r{T}) \] but at the same time, \[\begin{align*} p(\r{S}\mid \r{T}, \r{M}) & \gt p(\r{S}\mid \neg \r{T}, \r{M} ) \\ p(\r{S}\mid \r{T}, \neg \r{M}) &\gt p(\r{S}\mid \neg \r{T}, \neg \r{M}) \end{align*}\] Should we use the treatment or not? When we know the gender of the patient, we would presumably administer the treatment, whereas it does not look like the right thing to do when we don’t know the patient’s gender—although we know that the patient is either male or female! This phenomenon was first pointed out in papers by Karl G. Pearson (1899) and George U. Yule (1903), but it was Simpson’s short paper “The interpretation of interaction in contingency tables” (1951), discussing the interpretation of such association reversals, that led to the phenomenon being labeled as “Simpson’s Paradox”. The phenomenon is, however, broader than independence in the overall population and positive association in the subpopulations; for example, the associations may also be reversed. Nagel and Cohen (1934: ch. 16) provide an example of such a reversal as part of a exercise for logic students. Understanding the paradox is essential for drawing the proper conclusions from statistical data. To give a recent example involving the paradox (Kügelgen, Gresele, & Schölkopf 2021), early data revealed that the case fatality rate for Covid-19 was higher in Italy than in China overall. Yet within every age group the fatality rate was higher in China than in Italy. One thus appears to get opposite conclusions about the comparative severity of the virus in the countries depending on whether one compares the whole populations or the age-partitioned populations. Having a proper analysis of what is going on is such cases is thus crucial for using statistics to inform policy. In what follows, Section 2 explains different varieties of the paradox, clarifies the logical relationships between them, and identifies precise conditions for when the paradox can occur. While that section focuses on the mathematical characterization of the paradox, Section 3 focuses on its role in causal inference, its implications for probabilistic theories of causality, and its analysis by means of causal models based on directed acyclic graphs (DAGs: Spirtes, Glymour, & Scheines 2000; Pearl 2000 [2009]). Based on these different approaches, Section 4 discusses different analyses of what makes Simpson’s Paradox look paradoxical, and what kind of error it reveals in human reasoning. This section also reports empirical findings on the prevalence of the paradox in reasoning and inference. Section 5 surveys the occurrence and interpretation of the paradox in applied statistics (regression models), philosophy of biology, decision theory and public policy. For example, Simpson’s Paradox is relevant when analyzing data to test for race or gender discrimination (Bickel, Hammel, & O’Connell 1975). Section 6 wraps up our findings and concludes. 2. Definition and Mathematical Characterization This section shows how Simpson’s Paradox can be characterized mathematically, under which conditions it occurs, and how it can be avoided. We begin by further considering the concrete example from the introduction in order to build intuitions that will guide us through the more technical results. The data in Table 1 can be translated into success or recovery rates, showing that treated men have a higher recovery rate than untreated men (roughly 61% vs. 57%), and the same for women (44% vs. 40%). Two observations are key to understanding why this positive association vanishes in the aggregate data. First, the recovery rate of untreated men is still higher than the recovery rate of women who receive treatment (57% vs. 44%), suggesting that not only treatment, but also gender is a relevant predictor of recovery. Second, while the treatment group is majority female (27 vs. 13), the control group is majority male (7 vs. 5). Speaking informally, the lack of population-level correlation between treatment and recovery results from men being both (i) more likely to recover from the treatment, and (ii) less likely to be in the treatment group. This becomes evident when we use conditional probabilities to represent recovery rates given treatment and/or subpopulation. The overall recovery rates given treatment and control can, by the Law of Total Probability, be written as the weighted average of recovery rates in the subpopulations: \[\begin{align*} p(\r{S}\mid \r{T}) &= p(\r{S}\mid \r{T},\r{M}) p(\r{M}\mid \r{T}) + p(\r{S}\mid \r{T}, \neg \r{M}) p(\neg \r{M}\mid \r{T}) \\ p(\r{S}\mid \neg \r{T}) &= p(\r{S}\mid \neg \r{T},\r{M}) p(\r{M}\mid \neg \r{T}) + p(\r{S}\mid \neg \r{T}, \neg \r{M}) p(\neg \r{M}\mid \neg \r{T})\end{align*}\] Plugging in the numbers from Table 1 to calculate the overall recovery rates via these equations, we see that the first line is a weighted average of success rates for treated men and women (61% and 44%) while the second line is a weighted average of success rates of the two control groups (57% and 40%). These averages are weighted by the percentage of males and females in each group, and in the present case the gender disparity between the groups results in both averages being 50%. Since these weights can be different, the treatment may raise the probability of success among males and females without doing so in the combined population. Later we will show that the positive association in the subpopulations cannot vanish if the correlation of treatment with gender is broken (e.g., by balancing gender rates in both conditions). The weights in each line are then identical—\(p(\r{M}\mid \r{T}) = p(\r{M}\mid \neg \r{T})\)—and associations in subpopulations are preserved for the aggregate data (Theorem 1 in Section 2.2). In fact, the absence of such a correlation rules out Simpson’s Paradox. In what follows, we interpret Simpson’s paradox as a property of association between variables, expressed by conditional probabilities. This perspective is not uncontentious. For Spanos (2021), it amounts to a deductive take on the paradox, contrary to the original understanding of the paradox as a challenge for inductive, statistical learning. In Pearson’s and Yule’s original papers (and in applications involving linear regression models, see Section 5.1), the paradox i… plato.stanford.edu · leibniz.stanford.edu
Comments