Click here to hide/show the list of notebooks.
  pyAgrum on notebooks   pyAgrum jupyter
☰  Causality_Tobacco 
pyAgrum 0.15.2   
Zipped notebooks   
generation: 2019-07-22 10:34  

Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Correlation between Smoking and Cancer

This notebook follows the famous example from Causality (Pearl, 2009).

A correlation has been observed between Smoking and Cancer, represented by this Bayesian Network :

In [1]:
from IPython.display import display, Math, Latex,HTML

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb



obs1 = gum.fastBN("Smoking->Cancer")

obs1.cpt("Smoking")[:]=[0.6,0.4]
obs1.cpt("Cancer")[{"Smoking":0}]=[0.9,0.1]
obs1.cpt("Cancer")[{"Smoking":1}]=[0.7,0.3]
                  
gnb.sideBySide(obs1,obs1.cpt("Smoking")*obs1.cpt("Cancer"),obs1.cpt("Smoking"),obs1.cpt("Cancer"),
               captions=["the BN","the joint distribution","the marginal for $smoking$","the CPT for $cancer$"])
G Smoking Smoking Cancer Cancer Smoking->Cancer
Smoking
Cancer
0
1
0
0.54000.2800
1
0.06000.1200
Smoking
0
1
0.60000.4000
Cancer
Smoking
0
1
0
0.90000.1000
1
0.70000.3000
the BN
the joint distribution
the marginal for $smoking$
the CPT for $cancer$

Direct causality between Smoking and Cancer

The very strong observed correlation between smoking and lung cancer suggests a causal relationship as the Surgeon General asserts in 1964, then, the proposed model is as follows :

In [2]:
# the bayesian network is causal
modele1=csl.CausalModel(obs1)

cslnb.showCausalImpact(modele1,"Cancer", "Smoking", values={"Smoking":1})
G Smoking Smoking Cancer Cancer Smoking->Cancer
$$\begin{equation}P( Cancer \mid \hookrightarrow Smoking) = P\left(Cancer\mid Smoking\right)\end{equation}$$
Cancer
0
1
0.70000.3000
Causal Model
Explanation : Do-calculus computations
Impact : $P( Cancer \mid \hookrightarrow Smoking=1)$

Latent confounder between Smoking and Cancer

This model is highly contested by the tobacco industry which answers by proposing a different model in which Smoking and Cancer are simultaneously provoked by a common factor, the Genotype (or other latent variable) :

In [3]:
# a latent varible exists between Smoking and Cancer in the causal model
modele2 = csl.CausalModel(obs1, [("Genotype", ["Smoking","Cancer"])])

cslnb.showCausalImpact(modele2, "Cancer", "Smoking",values={"Smoking":1})
G Smoking Smoking Cancer Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Cancer \mid \hookrightarrow Smoking) = P\left(Cancer\right)\end{equation}$$
Cancer
0
1
0.82000.1800
Causal Model
Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).
Impact : $P( Cancer \mid \hookrightarrow Smoking=1)$
In [4]:
# just check P(Cancer) in the bn `obs1`
(obs1.cpt("Smoking")*obs1.cpt("Cancer")).margSumIn(["Cancer"])
Out[4]:
Cancer
0
1
0.82000.1800

Confounder and direct causality

In a diplomatic effort, both parts agree that there must be some truth in both models :

In [5]:
# a latent variable exists between Smoking and Cancer but the direct causal relation exists also
modele3 = csl.CausalModel(obs1, [("Genotype", ["Smoking","Cancer"])], True)

cslnb.showCausalImpact(modele3, "Cancer", "Smoking",values={"Smoking":1})
G Smoking Smoking Cancer Cancer Smoking->Cancer Genotype Genotype->Smoking Genotype->Cancer
?
No result
Causal Model
Explanation : Hedge Error: G={'Smoking', 'Cancer'},G[S]={'Cancer'}
Impact : $?$

Smoking's causal effect on Cancer becomes uncomputable in such a model because we can't distinguish both causes' impact from the observations.

A intermediary observed variable

We introduce an auxilary factor between Smoking and Cancer, tobacco causes cancer because of the tar deposits in the lungs.

In [6]:
obs2 = gum.fastBN("Smoking->Tar->Cancer;Smoking->Cancer")

obs2.cpt("Smoking")[:] = [0.6, 0.4]
obs2.cpt("Tar")[{"Smoking": 0}] = [0.9, 0.1]
obs2.cpt("Tar")[{"Smoking": 1}] = [0.7, 0.3]
obs2.cpt("Cancer")[{"Tar": 0, "Smoking": 0}] = [0.9, 0.1]
obs2.cpt("Cancer")[{"Tar": 1, "Smoking": 0}] = [0.8, 0.2]
obs2.cpt("Cancer")[{"Tar": 0, "Smoking": 1}] = [0.7, 0.3]
obs2.cpt("Cancer")[{"Tar": 1, "Smoking": 1}] = [0.6, 0.4]

gnb.sideBySide(obs2,obs2.cpt("Smoking"),obs2.cpt("Tar"),obs2.cpt("Cancer"),
               captions=["","$P(Smoking)$","$P(Tar|Smoking)$","$P(Cancer|Tar,Smoking)$"])
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Smoking->Cancer Tar->Cancer
Smoking
0
1
0.60000.4000
Tar
Smoking
0
1
0
0.90000.1000
1
0.70000.3000
Cancer
Tar
Smoking
0
1
0
0
0.90000.1000
1
0.80000.2000
0
1
0.70000.3000
1
0.60000.4000
$P(Smoking)$
$P(Tar|Smoking)$
$P(Cancer|Tar,Smoking)$
In [7]:
modele4 = csl.CausalModel(obs2, [("Genotype", ["Smoking","Cancer"])])

cslnb.showCausalImpact(modele4, "Cancer", "Smoking",values={"Smoking":1})
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Tar->Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Cancer \mid \hookrightarrow Smoking) = \sum_{Tar}{P\left(Tar\mid Smoking\right) \cdot \left(\sum_{Smoking'}{P\left(Cancer\mid Smoking',Tar\right) \cdot P\left(Smoking'\right)}\right)}\end{equation}$$
Cancer
0
1
0.79000.2100
Causal Model
Explanation : frontdoor ['Tar'] found.
Impact : $P( Cancer \mid \hookrightarrow Smoking=1)$

In this model, we are, again, able to calculate the causal impact of Smoking on Cancer thanks to the verification of the Frontdoor criterion by the Tar relatively to the couple (Smoking, Cancer)

In [8]:
# just check P(Cancer|do(smoking)) in the bn `obs2`
((obs2.cpt("Cancer")*obs2.cpt("Smoking")).margSumOut(["Smoking"])*obs2.cpt("Tar")).margSumOut(['Tar']).putFirst("Cancer")
Out[8]:
Cancer
Smoking
0
1
0
0.81000.1900
1
0.79000.2100

Other causal impacts for this last model

In [9]:
cslnb.showCausalImpact(modele4, "Smoking", doing="Cancer",knowing={"Tar"}, values={"Cancer":1,"Tar":1})
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Tar->Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Smoking \mid \hookrightarrow Cancer, Tar) = P\left(Smoking\mid Tar\right)\end{equation}$$
Smoking
0
1
0.33330.6667
Causal Model
Explanation : No causal effect of X on Y, because they are d-separated (conditioning on the observed variables if any).
Impact : $P( Smoking \mid \hookrightarrow Cancer=1, Tar=1)$
In [10]:
cslnb.showCausalImpact(modele4, "Smoking", doing="Cancer",values={"Cancer":1})
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Tar->Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Smoking \mid \hookrightarrow Cancer) = P\left(Smoking\right)\end{equation}$$
Smoking
0
1
0.60000.4000
Causal Model
Explanation : Do-calculus computations
Impact : $P( Smoking \mid \hookrightarrow Cancer=1)$
In [11]:
cslnb.showCausalImpact(modele4, "Smoking", doing={"Cancer","Tar"},values={"Cancer":1,"Tar":1})
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Tar->Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Smoking \mid \hookrightarrow Tar, \hookrightarrow Cancer) = P\left(Smoking\right)\end{equation}$$
Smoking
0
1
0.60000.4000
Causal Model
Explanation : Do-calculus computations
Impact : $P( Smoking \mid \hookrightarrow Tar=1, \hookrightarrow Cancer=1)$
In [12]:
cslnb.showCausalImpact(modele4, "Tar", doing={"Cancer","Smoking"},values={"Cancer":1,"Smoking":1})
G Smoking Smoking Tar Tar Smoking->Tar Cancer Cancer Tar->Cancer Genotype Genotype->Smoking Genotype->Cancer
$$\begin{equation}P( Tar \mid \hookrightarrow Smoking, \hookrightarrow Cancer) = P\left(Tar\mid Smoking\right)\end{equation}$$
Tar
0
1
0.70000.3000
Causal Model
Explanation : Do-calculus computations
Impact : $P( Tar \mid \hookrightarrow Smoking=1, \hookrightarrow Cancer=1)$