Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Author: Aymen Merrouche and Pierre-Henri Wuillemin.

**Smoking**

This notebook follows the example from "The Book Of Why" (Pearl, 2018) chapter 5.

In [1]:
from IPython.display import display, Math, Latex,HTML

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
import os

In the 1950s the strong association between smoking and lung cancer provoked a debate on the issue. Does smoking cause lung cancer?

We create the causal diagram:

The corresponding causal diagram is the following:

In [2]:
sc = gum.fastBN("Smoking->Lung Cancer")
sc
Out[2]:
G Smoking Smoking Lung Cancer Lung Cancer Smoking->Lung Cancer

Constitutional Hypothesis:

Smoking industry and some other skeptic statisticians advanced the theory that smokers are genetically different from nonsmokers. A smoking gene could be a confounder that would explain the observed association.

In [3]:
msc = csl.CausalModel(sc, [("Smoking Gene", ["Smoking","Lung Cancer"])])
cslnb.showCausalImpact(msc, "Lung Cancer", doing="Smoking",values={})
G Smoking Smoking Lung Cancer Lung Cancer Smoking Gene Smoking Gene->Smoking Smoking Gene->Lung Cancer
$$\begin{equation}P( Lung Cancer \mid \hookrightarrow\mkern-6.5muSmoking) = P\left(Lung Cancer\right)\end{equation}$$
Lung Cancer
0
1
0.08360.9164
Causal Model
Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).
Impact : $P( Lung Cancer \mid \hookrightarrow\mkern-6.5muSmoking)$

This constitutional hypothesis was untestable, we couldn't sequence the human genome at the time.
However, this hypothesis wasn't plausible because the observed association was way too strong.

To explain this association, another hypothesis was that a smoking gene could be a confounder but there was still a direct causal effect between smoking on lung cancer:

In [4]:
msc = csl.CausalModel(sc, [("Smoking Gene", ["Smoking","Lung Cancer"])], True)
cslnb.showCausalImpact(msc, "Lung Cancer", doing="Smoking",values={})
G Smoking Smoking Lung Cancer Lung Cancer Smoking->Lung Cancer Smoking Gene Smoking Gene->Smoking Smoking Gene->Lung Cancer
?
No result
Causal Model
Explanation : Hedge Error: G={'Lung Cancer', 'Smoking'}, G[S]={'Lung Cancer'}
Impact : $?$

Front door criterion:

Let's suppose now that smoking causes cancer only through tar deposits that are fully due to the physical action of cigarettes, the causal diagram becomes:

In [5]:
sct = gum.fastBN("Smoking->Tar->Lung Cancer")
sct
Out[5]:
G Smoking Smoking Tar Tar Smoking->Tar Lung Cancer Lung Cancer Tar->Lung Cancer
In [6]:
msct = csl.CausalModel(sct, [("Smoking Gene", ["Smoking","Lung Cancer"])], True)
cslnb.showCausalImpact(msct, "Lung Cancer", doing="Smoking",values={})
G Smoking Smoking Tar Tar Smoking->Tar Lung Cancer Lung Cancer Tar->Lung Cancer Smoking Gene Smoking Gene->Smoking Smoking Gene->Lung Cancer
$$\begin{equation}P( Lung Cancer \mid \hookrightarrow\mkern-6.5muSmoking) = \sum_{Tar}{P\left(Tar\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Lung Cancer\mid Smoking',Tar\right) \cdot P\left(Smoking'\right)}\right)}\end{equation}$$
Lung Cancer
Smoking
0
1
0
0.71120.2888
1
0.73620.2638
Causal Model
Explanation : frontdoor ['Tar'] found.
Impact : $P( Lung Cancer \mid \hookrightarrow\mkern-6.5muSmoking)$

Even if the smoking gene is unobservable, we can assess the causal effect of Smoking on Lung Cancer using the front-door method. In this case, the front-door is: $$Smoking \rightarrow \color{red}{Tar} \rightarrow LungCancer$$ It consists of variables that we have observed:

  • We can measure the causal effect of $Smoking$ on $Tar$, there are no open back-doors between the two ($Tar \leftarrow Smoking \rightarrow SmokingGene \leftarrow LungCancer$ is blocked by the collider node $LungCancer$) $$P(Tar \mid do(Smoking)) = P (Tar \mid Smoking) $$
In [7]:
formula, adj, exp = csl.causalImpact(msct,on = "Tar",doing = "Smoking",values = {})
display(Math(formula.toLatex()))
$\displaystyle P( Tar \mid \hookrightarrow\mkern-6.5muSmoking) = P\left(Tar\mid Smoking\right)$
  • We can measure the causal effect of $Tar$ on $LungCancer$, we just need to adjust for the $Smoking$ to block the "back-door path" $ Tar \leftarrow Smoking \leftarrow SmokingGene \rightarrow LungCancer$ $$P(LungCancer \mid do(Tar)) = \sum_{Smoking}{P(LungCancer \mid Tar, Smoking) \times P(Smoking)}$$
In [8]:
formula, adj, exp = csl.causalImpact(msct,on = "Lung Cancer",doing = "Tar",values = {})
display(Math(formula.toLatex()))
$\displaystyle P( Lung Cancer \mid \hookrightarrow\mkern-6.5muTar) = \sum_{Smoking}{P\left(Lung Cancer\mid Smoking,Tar\right) \cdot P\left(Smoking\right)}$

We can now combine these two pieces of information to have the causal effect of $Smoking$ on $LungCancer$ and reduce the expression of $P(LungCancer \mid do(Smoking))$ to elements that we observed: $$ P(LungCancer \mid do(Smoking)) = \sum_{Tar}{(P(Tar \mid Smoking) \times \sum_{Smoking^{'}}{P(LungCancer \mid Tar, Smoking^{'}) \times P(Smoking^{'})})} $$

Birth-weight paradox:

Studies have shown that babies of smoking mothers tend to weigh less than average. Other studies have shown that low-birth-weight babies have a higher mortality rate than normal-birth-weight babies. The corresponding causal diagram is the following causal:

In [9]:
bwp = gum.fastBN("Smoking->Low Birth Weight->Mortality")
bwp
Out[9]:
G Smoking Smoking Low Birth Weight Low Birth Weight Smoking->Low Birth Weight Mortality Mortality Low Birth Weight->Mortality
In [10]:
# Causal effect of Smoking on neo-natal mortality
bwpModele = csl.CausalModel(bwp)
cslnb.showCausalImpact(bwpModele, "Mortality", doing="Smoking",values={})
G Smoking Smoking Low Birth Weight Low Birth Weight Smoking->Low Birth Weight Mortality Mortality Low Birth Weight->Mortality
$$\begin{equation}P( Mortality \mid \hookrightarrow\mkern-6.5muSmoking) = \sum_{Low Birth Weight}{P\left(Low Birth Weight\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Mortality\mid Low Birth Weight\right) \cdot P\left(Smoking'\right)}\right)}\end{equation}$$
Mortality
Smoking
0
1
0
0.34110.6589
1
0.35930.6407
Causal Model
Explanation : frontdoor ['Low Birth Weight'] found.
Impact : $P( Mortality \mid \hookrightarrow\mkern-6.5muSmoking)$

However the data also showed that low-birth-weight babies of smoker mothers had lower mortality rates than low-birth-weight babies of non-smoker mothers.
An explanation for this paradoxical situation is that low-birth-weight is either due to a smoking mother or to another birth defect that is much more threatening to the baby's health. The causal diagram becomes:

In [11]:
bwpe = gum.fastBN("Smoking->Low Birth Weight->Mortality<-Smoking;Birth defect->Low Birth Weight;Mortality<-Birth defect")
bwpe
Out[11]:
G Smoking Smoking Low Birth Weight Low Birth Weight Smoking->Low Birth Weight Mortality Mortality Smoking->Mortality Low Birth Weight->Mortality Birth defect Birth defect Birth defect->Low Birth Weight Birth defect->Mortality

Pinpointing the source of this paradoxical situation becomes easy thanks to this causal diagram: "collider bias"."Low Birth Weight" is a collider! The data only concerned low-birth-weight babies (it is as if we are adjusting for "Low Birth Weight."). Knowing that the mother doesn't smoke increases our belief that a birth defect is the cause of the low-birth-weight, and a birth defect is more threatening for the baby's health. This opened the backdoor path formerly blocked and allowed non-causal information to flow from Smoking to Mortality ($Smoking \rightarrow Low Birth Weight \leftarrow Birth defect \rightarrow Mortality $) introducing a bias.

In [12]:
bwpeModele = csl.CausalModel(bwpe)
cslnb.showCausalImpact(bwpeModele, "Mortality", doing="Smoking",values={})
G Smoking Smoking Low Birth Weight Low Birth Weight Smoking->Low Birth Weight Mortality Mortality Smoking->Mortality Low Birth Weight->Mortality Birth defect Birth defect Birth defect->Low Birth Weight Birth defect->Mortality
$$\begin{equation}P( Mortality \mid \hookrightarrow\mkern-6.5muSmoking) = \sum_{Birth defect}{P\left(Mortality\mid Birth defect,Smoking\right) \cdot P\left(Birth defect\right)}\end{equation}$$
Mortality
Smoking
0
1
0
0.53810.4619
1
0.46330.5367
Causal Model
Explanation : backdoor ['Birth defect'] found.
Impact : $P( Mortality \mid \hookrightarrow\mkern-6.5muSmoking)$
In [ ]: