This notebook uses pandas to learn the parameters. However the simplest way to learn parameters is to use BNLearner
:-). Moreover, you will be able to add priors, etc (see learning BN).
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt
import os
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
bn=gum.loadBN("res/asia.bif")
bn2=gum.loadBN("res/asia.bif")
gnb.sideBySide(bn,bn2,
captions=['First bn','Second bn'])
bn.generateCPTs()
bn2.generateCPTs()
from IPython.display import HTML
gnb.sideBySide(bn.cpt(3),
bn2.cpt(3),
captions=['<h3>cpt of node 3 in first bn</h3>','<h3>same cpt in second bn</h3>'])
Since the BN is not too big, BruteForceKL can be computed ...
g1=gum.ExactBNdistance(bn,bn2)
before_learning=g1.compute()
print(f"klPQ computed : {before_learning['klPQ']}")
klPQ computed : 2.5430647111804485
Just to be sure that the distance between a BN and itself is 0 :
g0=gum.ExactBNdistance(bn,bn)
print(g0.compute()['klPQ'])
0.0
gum.generateSample(bn,10000,"out/test.csv",True)
out/test.csv: 100%|████████████████████████████████████████|
Log2-Likelihood : -59457.80123453925
-59457.80123453925
As an exercise, we will use pandas to learn the parameters.
# using bn as a template for the specification of variables in test.csv
learner=gum.BNLearner("out/test.csv",bn)
bn3=learner.learnParameters(bn.dag())
#the same but we add a Laplace adjustment (smoothing) as a Prior
learner=gum.BNLearner("out/test.csv",bn)
learner.useSmoothingPrior(1000) # a count C is replaced by C+1000
bn4=learner.learnParameters(bn.dag())
after_pyAgrum_learning=gum.ExactBNdistance(bn,bn3).compute()
after_pyAgrum_learning_with_smoothing=gum.ExactBNdistance(bn,bn4).compute()
print("without prior:{}".format(after_pyAgrum_learning['klPQ']))
print("with prior smooting(1000):{}".format(after_pyAgrum_learning_with_smoothing['klPQ']))
without prior:0.001173630952397425 with prior smooting(1000):0.19694665735507919
import pandas
# We directly generate samples in a DataFrame
df,_=gum.generateSample(bn,10000,None,True)
100%|██████████████████████████████████████████████████████|
Log2-Likelihood : -59689.093634113604
df.head()
smoking | positive_XraY | lung_cancer | tuberculosis | dyspnoea | tuberculos_or_cancer | bronchitis | visit_to_Asia | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |
2 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
We use the crosstab function in pandas
c=pandas.crosstab(df['dyspnoea'],[df['tuberculos_or_cancer'],df['bronchitis']])
c
tuberculos_or_cancer | 0 | 1 | ||
---|---|---|---|---|
bronchitis | 0 | 1 | 0 | 1 |
dyspnoea | ||||
0 | 1194 | 666 | 1126 | 1403 |
1 | 338 | 1696 | 1293 | 2284 |
Playing with numpy reshaping, we retrieve the good form for the CPT from the pandas cross-table
gnb.sideBySide('<pre>'+str(np.array((c/c.sum().apply(np.float32)).transpose()).reshape(2,2,2))+'</pre>',
bn.cpt(bn.idFromName('dyspnoea')),
captions=["<h3>Learned parameters in crosstab","<h3>Original parameters in bn</h3>"])
def computeCPTfromDF(bn,df,name):
"""
Compute the CPT of variable "name" in the BN bn from the database df
"""
id=bn.idFromName(name)
parents=list(reversed(bn.cpt(id).names))
domains=[bn.variableFromName(name).domainSize()
for name in parents]
parents.pop()
if (len(parents)>0):
c=pandas.crosstab(df[name],[df[parent] for parent in parents])
s=c/c.sum().apply(np.float32)
else:
s=df[name].value_counts(normalize=True)
bn.cpt(id)[:]=np.array((s).transpose()).reshape(*domains)
def ParametersLearning(bn,df):
"""
Compute the CPTs of every varaible in the BN bn from the database df
"""
for name in bn.names():
computeCPTfromDF(bn,df,name)
ParametersLearning(bn2,df)
KL has decreased a lot (if everything's OK)
g1=gum.ExactBNdistance(bn,bn2)
print("BEFORE LEARNING")
print(before_learning['klPQ'])
print
print("AFTER LEARNING")
print(g1.compute()['klPQ'])
BEFORE LEARNING 2.5430647111804485 AFTER LEARNING 0.0008003336295384374
And CPTs should be close
gnb.sideBySide(bn.cpt(3),
bn2.cpt(3),
captions=["<h3>Original BN","<h3>learned BN</h3>"])
What is the effect of increasing the size of the database on the KL ? We expect that the KL decreases to 0.
res=[]
for i in range(200,10001,50):
ParametersLearning(bn2,df[:i])
g1=gum.ExactBNdistance(bn,bn2)
res.append(g1.compute()['klPQ'])
fig=figure(figsize=(10,6))
ax = fig.add_subplot(1, 1, 1)
ax.plot(range(200,10001,50),res)
ax.set_xlabel("size of the database")
ax.set_ylabel("KL")
ax.set_title("klPQ(bn,learnedBN(x))");