Author: Aymen Merrouche and Pierre-Henri Wuillemin.

**The Effect of Education and Experience on Salary**

## Counterfactuals¶

In [1]:
from IPython.display import display, Math, Latex,HTML

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.causal as csl
import pyAgrum.causal.notebook as cslnb
import os
import math
import numpy as np
import scipy.stats


In this example we are interested in the effect of experience and education on the salary of an employee, we are in possession of the following data:

Employé EX(u) ED(u) $S_{0}(u)$ $S_{1}(u)$ $S_{2}(u)$
Alice 8 0 86,000 ? ?
Bert 9 1 ? 92,500 ?
Caroline 9 2 ? ? 97,000
David 8 1 ? 91,000 ?
Ernest 12 1 ? 100,000 ?
Frances 13 0 97,000 ? ?
etc
• $EX(u)$ : years of experience of employee $u$. [0,20]
• $ED(u)$ : Level of education of employee $u$ (0:high school degree (low), 1:college degree (medium), 2:graduate degree (high)) [0,2]
• $S_{i}(u)$ [65k,150k] :
• salary (observable) of employee $u$ if $i = ED(u)$,
• Potential outcome (unobservable) if $i \not = ED(u)$, salary of employee $u$ if he had a level of education of $i$.

We are left with the previous data and we want to answer the counterfactual question What would Alice's salary be if she attended college ? (i.e. $S_{1}(Alice)$)

### We create the causal diagram¶

In this model it is assumed that an employee's salary is determined by his level of education and his experience. Years of experience are also affected by the level of education. Having a higher level of education means spending more time studying hence less experience.

In [2]:
edex = gum.fastBN("Ux[-2,10]->experience[0,20]<-education{low|medium|high}->salary[65,150]<-Us[0,25];experience->salary")
edex

Out[2]:

However counterfactual queries are specific to one datapoint (in our case Alice), we need to add additional variables to our model to allow for individual variations:

• Us : unobserved variables that affect salary.[0,25k]
• Ux : unobserved variables that affect experience.[-2,10]
In [3]:
# no prior information about the individual (datapoint)
edex.cpt("Us").fillWith(1).normalize()
edex.cpt("Ux").fillWith(1).normalize()
# education level(supposed)
edex.cpt("education")[:] = [0.4, 0.4, 0.2]


Experience listens to Education and Ux : $$Ex = 10 -4 \times Ed + Ux$$

In [4]:
edex.cpt("experience").fillWithFunction("10-4*education+Ux")
edex.cpt("experience")

Out[4]:
experience
education
Ux
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
low
-2
0.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.0000
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.0000
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.0000
medium
-2
0.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.0000
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.0000
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.0000
high
-2
1.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
-1
0.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
0
0.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
1
0.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
2
0.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
3
0.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
4
0.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
5
0.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
6
0.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
7
0.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
8
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
9
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
10
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.0000

Salary listens to Education, Experience and Us : $$S = 65 + 2.5 \times Ex + 5 \times Ed + Us$$

In [24]:
edex.cpt("salary").fillWithFunction("round(65+2.5*experience+5*education+Us)");

In [25]:
gnb.showInference(edex,size="10")


## Step 1 : Abduction¶

Use the data to retrieve all the information that characterizes Alice

From the data we can retrieve Alice's profile :

• $Ed(Alice)$ : 0
• $Ex(Alice)$ : 8
• $S_{0}(Alice)$ : 86k

We will use Alice's profile to get $U_s$ and $U_x$, which tell Alice apart from the rest of the data.

In [26]:
ie=gum.LazyPropagation(edex)
ie.setEvidence({'experience':8, 'education': 'low', 'salary' : "86"})
ie.makeInference()
newUs = ie.posterior("Us")
newUs

Out[26]:
Us
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
In [27]:
ie=gum.LazyPropagation(edex)
ie.setEvidence({'experience':8, 'education': 'low', 'salary' : "86"})
ie.makeInference()
newUx = ie.posterior("Ux")
newUx

Out[27]:
Ux
-2
-1
0
1
2
3
4
5
6
7
8
9
10
1.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
In [28]:
gnb.showInference(edex,evs={'experience':8, 'education': "low", 'salary' : "86"},targets={'Ux','Us'})


## Step 2 & 3 : Action And Prediction¶

Change the model to match the hypothesis implied by the query (if she had attended university) and then use the data that characterizes Alice to calculate her salary.

We create a counterfactual world with Alice's idiosyncratic factors, and we operate the intervention:

In [29]:
# the counterfactual world
edexCounterfactual = gum.BayesNet(edex)

In [30]:
# we replace the prior probabilities of idiosyncratic factors with potentials calculated earlier
edexCounterfactual.cpt("Ux").fillWith(newUx)
edexCounterfactual.cpt("Us").fillWith(newUs)
gnb.showInference(edexCounterfactual,size="10")
print("counterfactual world created")

counterfactual world created

In [31]:
# We operate the intervention
edexModele = csl.CausalModel(edexCounterfactual)
cslnb.showCausalImpact(edexModele,"salary",doing="education",values={"education":"medium"})

$$$$P( salary \mid \hookrightarrow\mkern-6.5mueducation) = \sum_{Us,Ux,experience}{P\left(Us\right) \cdot P\left(salary\mid Us,education,experience\right) \cdot P\left(experience\mid Ux,education\right) \cdot P\left(Ux\right)}$$$$
salary
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
0.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00001.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000
Causal Model
Explanation : Do-calculus computations
Impact : $P( salary \mid \hookrightarrow\mkern-6.5mueducation=medium)$

Since education has no parents in our model (no graph surgery, no causes to emancipate it from), an intervention is equivalent to an observation, the only thing we need to do is to set the value of education:

In [32]:
gnb.showInference(edexCounterfactual,targets={"salary",'experience'},evs={'education':"medium"},size="10")


The result (salary if she had attended college) is given by the formaula: $$\sum_{salary} salary \times P(salary^* \mid RealSalary = 86k, education = 0, experience = 8, education^*=1)$$ Where variables marked with an asterisk are inobservable.

# We can now use a function that answers counterfactual queries using the previous algorithm (in pyAgrum.causal.counterfactual)¶

In [33]:
help(csl.counterfactual)

Help on function counterfactual in module pyAgrum.causal._causalImpact:

counterfactual(cm: pyAgrum.causal._CausalModel.CausalModel, profile: Union[Dict[str, int], NoneType], on: Union[str, Set[str]], whatif: Union[str, Set[str]], values: Union[Dict[str, int], NoneType] = None) -> pyAgrum.pyAgrum.Potential
Determines the estimation of a counterfactual query following the the three steps algorithm from "The Book Of Why" (Pearl 2018) chapter 8 page 253.

Determines the estimation of the counterfactual query: Given the "profile" (dictionary <variable name>:<value>),what would variables in "on" (single or list of variables) be if variables in "whatif" (single or list of variables) had been as specified in "values" (dictionary <variable name>:<value>)(optional).

This is done according to the following algorithm:
-Step 1-2: compute the twin causal model
-Step 3 : determine the causal impact of the interventions specified in  "whatif" on the single or list of variables "on" in the causal model.

This function returns the potential calculated in step 3, representing the probability distribution of  "on" given the interventions  "whatif", if it had been as specified in "values" (if "values" is omitted, every possible value of "whatif")

:param cm: CausalModel
:param profile: Dictionary
:param on: variable name or variable names set
:param whatif: variable name or variable names set
:param values: Dictionary
:type cm: pyAgrum.causal.CausalModel
:type profile: Union[Dict[str, int], type(None)]
:type on: Union[str, Set[str]]
:type whatif: Union[str, Set[str]]
:type values: Union[Dict[str, int], type(None)]
:return: the computation
:rtype: gum.Potential



### Let's try with the previous query :¶

In [34]:
pot=csl.counterfactual(cm = csl.CausalModel(edex),
profile = {'experience':8, 'education': "low", 'salary' : "86"},
whatif={"education"},
on={"salary"},
values = {"education" : "medium"})

In [35]:
gnb.showProba(pot)


We get the same result !

### If we omit values:¶

We get every potential outcome :

In [36]:
pot=csl.counterfactual(cm = csl.CausalModel(edex),
profile = {'experience':8, 'education': 'low', 'salary' : '86'},
whatif={"education"},
on={"salary"})

In [37]:
gnb.showPotential(pot)

education
salary
low
medium
high
65
0.00000.00000.0000
66
0.00000.00000.0000
67
0.00000.00000.0000
68
0.00000.00000.0000
69
0.00000.00000.0000
70
0.00000.00000.0000
71
0.00000.00000.0000
72
0.00000.00000.0000
73
0.00000.00000.0000
74
0.00000.00000.0000
75
0.00000.00000.0000
76
0.00000.00001.0000
77
0.00000.00000.0000
78
0.00000.00000.0000
79
0.00000.00000.0000
80
0.00000.00000.0000
81
0.00001.00000.0000
82
0.00000.00000.0000
83
0.00000.00000.0000
84
0.00000.00000.0000
85
0.00000.00000.0000
86
1.00000.00000.0000
87
0.00000.00000.0000
88
0.00000.00000.0000
89
0.00000.00000.0000
90
0.00000.00000.0000
91
0.00000.00000.0000
92
0.00000.00000.0000
93
0.00000.00000.0000
94
0.00000.00000.0000
95
0.00000.00000.0000
96
0.00000.00000.0000
97
0.00000.00000.0000
98
0.00000.00000.0000
99
0.00000.00000.0000
100
0.00000.00000.0000
101
0.00000.00000.0000
102
0.00000.00000.0000
103
0.00000.00000.0000
104
0.00000.00000.0000
105
0.00000.00000.0000
106
0.00000.00000.0000
107
0.00000.00000.0000
108
0.00000.00000.0000
109
0.00000.00000.0000
110
0.00000.00000.0000
111
0.00000.00000.0000
112
0.00000.00000.0000
113
0.00000.00000.0000
114
0.00000.00000.0000
115
0.00000.00000.0000
116
0.00000.00000.0000
117
0.00000.00000.0000
118
0.00000.00000.0000
119
0.00000.00000.0000
120
0.00000.00000.0000
121
0.00000.00000.0000
122
0.00000.00000.0000
123
0.00000.00000.0000
124
0.00000.00000.0000
125
0.00000.00000.0000
126
0.00000.00000.0000
127
0.00000.00000.0000
128
0.00000.00000.0000
129
0.00000.00000.0000
130
0.00000.00000.0000
131
0.00000.00000.0000
132
0.00000.00000.0000
133
0.00000.00000.0000
134
0.00000.00000.0000
135
0.00000.00000.0000
136
0.00000.00000.0000
137
0.00000.00000.0000
138
0.00000.00000.0000
139
0.00000.00000.0000
140
0.00000.00000.0000
141
0.00000.00000.0000
142
0.00000.00000.0000
143
0.00000.00000.0000
144
0.00000.00000.0000
145
0.00000.00000.0000
146
0.00000.00000.0000
147
0.00000.00000.0000
148
0.00000.00000.0000
149
0.00000.00000.0000
150
0.00000.00000.0000

### What would Alice's salary be if she had attended college and had 8 years of experience ?¶

In [38]:
pot=csl.counterfactual(cm = csl.CausalModel(edex),
profile = {'experience':8, 'education': 'low', 'salary' : '86'},
whatif={"education", "experience"},on={"salary"},
values = {"education" : 'medium', "experience" : 8})

In [39]:
gnb.showProba(pot)


### if she attended college and had 8 years of experience Alice's salary would be 91k !¶

In the previous query, Alice's salary if she attended college was lower than her actual salary, that's because in the counterfactual world where she attended college she had less time to work hence her diminished salary.

In this query, Alice's counterfactual salary was higher than her actual salary (+5k corresponding to one level of education), that's because in the counterfactual world Alice attended college and still had time to work 8 years, so her salary went up.

### it may happen that a whatif is not possible¶

In [40]:
pot=csl.counterfactual(cm = csl.CausalModel(edex),
profile = {'experience':8, 'education': 'low', 'salary' : '86'},
whatif={"experience"},
on={"salary"},
values = {"experience" : 12})

In [41]:
gnb.showProba(pot)


### indeed experience can not be 12¶

In [42]:
twin=csl.counterfactualModel(cm = csl.CausalModel(edex),
profile = {'experience':8, 'education': 'low', 'salary' : '86'},
whatif={"experience"},
on={"salary"})
gnb.showInference(twin.observationalBN(),size="10")