Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Author: Aymen Merrouche and Pierre-Henri Wuillemin.

**The Effect of Education and Experience on Salary**

This notebook follows the example from "The Book Of Why" (Pearl, 2018) chapter 8 page 251.

Counterfactuals

In [1]:
from IPython.display import display, Math, Latex,HTML

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
import os
import math
import numpy as np
import scipy.stats

In this example we are interested in the effect of experience and education on the salary of an employee, we are in possession of the following data:

Employé EX(u) ED(u) $S_{0}(u)$ $S_{1}(u)$ $S_{2}(u)$
Alice 8 0 86,000 ? ?
Bert 9 1 ? 92,500 ?
Caroline 9 2 ? ? 97,000
David 8 1 ? 91,000 ?
Ernest 12 1 ? 100,000 ?
Frances 13 0 97,000 ? ?
etc
  • $EX(u)$ : years of experience of employee $u$. [0,20]
  • $ED(u)$ : Level of education of employee $u$ (0:high school degree (low), 1:college degree (medium), 2:graduate degree (high)) [0,2]
  • $S_{i}(u)$ [65k,150k] :
    • salary (observable) of employee $u$ if $i = ED(u)$,
    • Potential outcome (unobservable) if $i \not = ED(u)$, salary of employee $u$ if he had a level of education of $i$.

We are left with the previous data and we want to answer the counterfactual question What would Alice's salary be if she attended college ? (i.e. $S_{1}(Alice)$)

We create the causal diagram

As in BoW-c8p251-educationAndExperience.ipynb, we create a BN for this problem. However, here we want to take into account some imprecisions in the equations : $$Ex = 10 -4 \times Ed + Ux$$ $$S = 65 + 2.5 \times Ex + 5 \times Ed + Us$$

In [2]:
# Model for the imprecisions in the equations
x_min = 0.0
x_max = 4.0

mean = 2.0
std = 0.65

x = np.linspace(x_min, x_max, 5)
y = scipy.stats.norm.pdf(x,mean,std)
print("We'll use the following distribution to model imprecision \n",y)
imprecision=list(y)
We'll use the following distribution to model imprecision 
 [0.00539715 0.18794845 0.61375735 0.18794845 0.00539715]
In [3]:
edex = gum.fastBN("Ux[-2,10]->experience[0,20]<-education{low|medium|high}->salary[65,150]<-Us[0,25];experience->salary")
# no prior information about the individual (datapoint)
edex.cpt("Us").fillWith(1).normalize()
edex.cpt("Ux").fillWith(1).normalize()
# education level(supposed)
edex.cpt("education")[:] = [0.4, 0.4, 0.2]
edex.cpt("experience").fillWithFunction("10-4*education+Ux",noise=imprecision)
edex.cpt("salary").fillWithFunction("round(65+2.5*experience+5*education+Us)",noise=imprecision)

edex.cpt("experience")
Out[3]:
experience
education
Ux
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
low
-2
0.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.0054
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18890.61680.1889
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00670.23290.7604
medium
-2
0.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.0000
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.0000
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.0000
high
-2
0.76040.23290.00670.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.18890.61680.18890.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.00000.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.00000.0000
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.00000.0000
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00540.18790.61350.18790.00540.00000.00000.00000.00000.00000.0000
In [4]:
gnb.showInference(edex)
structs Inference in   1.44ms Ux experience Ux->experience salary experience->salary education education->experience education->salary Us Us->salary

Counterfactual in pyAgrum

In [5]:
pot=csl.counterfactual(cm = csl.CausalModel(edex), 
                       profile = {'experience':8, 'education': "low", 'salary' : "86"},
                       whatif={"education"},
                       on={"salary"}, 
                       values = {"education" : "medium"})
gnb.showProba(pot)

If we omit values:

We get every potential outcome :

In [6]:
csl.counterfactual(cm = csl.CausalModel(edex), 
                   profile = {'experience':8, 'education': "low", 'salary' : "86"},
                   whatif={"education"},
                   on={"salary"}).putFirst("salary")
Out[6]:
salary
education
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
low
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00020.00110.00700.01830.02760.09120.14070.23170.16290.16370.05700.04760.03390.01270.00270.00140.00040.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
medium
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00020.00100.00250.02280.07210.04690.13690.24480.15400.07520.14520.05580.02560.01280.00340.00060.00030.00010.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
high
0.00000.00000.00000.00000.00000.00000.00000.00000.00060.02420.14490.28110.19420.18580.06380.05220.03570.01310.00270.00140.00040.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000

What would Alice's salary be if she had attended college and had 8 years of experience ?

In [7]:
pot=csl.counterfactual(cm = csl.CausalModel(edex), 
                       profile = {'experience':8, 'education': 'low', 'salary' : '86'},
                       whatif={"education", "experience"},
                       on={"salary"}, 
                       values = {"education" : 'medium', "experience" : 8})
In [8]:
gnb.showProba(pot)

if she attended college and had 8 years of experience Alice's salary would be 91k !

In the previous query, Alice's salary if she attended college was lower than her actual salary, that's because in the counterfactual world where she attended college she had less time to work hence her diminished salary.

In this query, Alice's counterfactual salary was higher than her actual salary (+5k corresponding to one level of education), that's because in the counterfactual world Alice attended college and still had time to work 8 years, so her salary went up.

if she had more experience :

still no answer to this question

In [9]:
pot=csl.counterfactual(cm = csl.CausalModel(edex), 
                       profile = {'experience':8, 'education': 'low', 'salary' : '86'},
                       whatif={"experience"},
                       on={"salary"}, 
                       values = {"experience" : 12})
In [10]:
gnb.showProba(pot)

Latent variable between $U_x$ and $experience$ :

In [11]:
edexModeleWithOne = csl.CausalModel(edex,[("u1", ["Ux","experience"])],False) #(<latent variable name>, <list of affected variables’ ids>).
edexModeleWithOne
Out[11]:
G Ux Ux experience experience salary salary experience->salary education education education->experience education->salary Us Us Us->salary u1 u1->Ux u1->experience
In [12]:
pot = csl.counterfactual(cm = edexModeleWithOne, 
                         profile = {'experience':8, 'education': "low", 'salary' : "86"},
                         whatif={"education"},
                         on={"salary"}, 
                         values = {"education" : "medium"})
gnb.showProba(pot)

With one latent variable between $U_x$ and $experience$, we get \$96k corresponding to one education level (we don't need to worry about experience any more.)

In [ ]: