Click here to hide/show the list of notebooks.
  pyAgrum on notebooks   pyAgrum jupyter
☰  parametersLearningWithPandas 
pyAgrum 0.17.3   
Zipped notebooks   
generation: 2020-04-27 18:59  

Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

In [1]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

Initialisation

  • importing pyAgrum
  • importing pyAgrum.lib tools
  • loading a BN
In [2]:
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

Loading two BNs

In [3]:
bn=gum.loadBN(os.path.join("res","asia.bif"))
bn2=gum.loadBN(os.path.join("res","asia.bif"))

gnb.sideBySide(bn,bn2,
               captions=['First bn','Second bn'])
G visit_to_Asia visit_to_Asia tuberculosis tuberculosis visit_to_Asia->tuberculosis tuberculos_or_cancer tuberculos_or_cancer tuberculosis->tuberculos_or_cancer positive_XraY positive_XraY tuberculos_or_cancer->positive_XraY dyspnoea dyspnoea tuberculos_or_cancer->dyspnoea lung_cancer lung_cancer lung_cancer->tuberculos_or_cancer smoking smoking smoking->lung_cancer bronchitis bronchitis smoking->bronchitis bronchitis->dyspnoea
G visit_to_Asia visit_to_Asia tuberculosis tuberculosis visit_to_Asia->tuberculosis tuberculos_or_cancer tuberculos_or_cancer tuberculosis->tuberculos_or_cancer positive_XraY positive_XraY tuberculos_or_cancer->positive_XraY dyspnoea dyspnoea tuberculos_or_cancer->dyspnoea lung_cancer lung_cancer lung_cancer->tuberculos_or_cancer smoking smoking smoking->lung_cancer bronchitis bronchitis smoking->bronchitis bronchitis->dyspnoea
First bn
Second bn

Randomizing the parameters

In [4]:
bn.generateCPTs()
bn2.generateCPTs()

Direct comparison of parameters

In [5]:
from IPython.display import HTML

gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=['<h3>cpt of node 3 in first bn</h3>','<h3>same cpt in second bn</h3>'])
positive_XraY
tuberculos_or_cancer
0
1
0
0.61720.3828
1
0.59560.4044
positive_XraY
tuberculos_or_cancer
0
1
0
0.08250.9175
1
0.56790.4321

cpt of node 3 in first bn

same cpt in second bn

Exact KL-divergence

Since the BN is not too big, BruteForceKL can be computed ...

In [6]:
g1=gum.ExactBNdistance(bn,bn2)
before_learning=g1.compute()
print(before_learning['klPQ'])
5.3266932860181795

Just to be sure that the distance between a BN and itself is 0 :

In [7]:
g0=gum.ExactBNdistance(bn,bn)
print(g0.compute()['klPQ'])
0.0

Generate a database from the original BN

In [8]:
gum.generateCSV(bn,os.path.join("out","test.csv"),10000,True)
 out/test.csv : [ ############################################################ ] 100%
Log2-Likelihood : -65351.76313426007
Out[8]:
-65351.76313426007

Using pandas for _counting

As an exercise, we will use pandas to learn the parameters. However the simplest way to learn parameters is to use `BNLearner` :-). Moreover, you will be able to add priors, etc.

In [9]:
# using bn as a template for the specification of variables in test.csv
learner=gum.BNLearner(os.path.join("out","test.csv"),bn) 
bn3=learner.learnParameters(bn.dag())

#the same but we add a Laplace adjustment as a Prior
learner=gum.BNLearner(os.path.join("out","test.csv"),bn) 
learner.useAprioriSmoothing(1000) # a count C is replaced by C+1000
bn4=learner.learnParameters(bn.dag())

after_pyAgrum_learning=gum.ExactBNdistance(bn,bn3).compute()
after_pyAgrum_learning_with_laplace=gum.ExactBNdistance(bn,bn4).compute()
print("without priori :{}".format(after_pyAgrum_learning['klPQ']))
print("with prior smooting(1000):{}".format(after_pyAgrum_learning_with_laplace['klPQ']))
without priori :0.0010335395114929016
with prior smooting(1000):0.19152856217745667

Now, let's try to learn the parameters with pandas

In [10]:
import pandas
df=pandas.read_csv(os.path.join("out","test.csv"))
df.head()
Out[10]:
tuberculosis visit_to_Asia smoking bronchitis lung_cancer tuberculos_or_cancer positive_XraY dyspnoea
0 0 0 1 0 0 0 1 0
1 0 0 0 0 0 0 0 0
2 0 1 1 0 0 0 0 0
3 1 1 0 0 0 1 0 1
4 1 1 0 0 0 0 0 0

We use the crosstab function in pandas

In [12]:
c=pandas.crosstab(df['dyspnoea'],[df['tuberculos_or_cancer'],df['bronchitis']])
c
Out[12]:
tuberculos_or_cancer 0 1
bronchitis 0 1 0 1
dyspnoea
0 2261 2309 1225 1082
1 355 462 948 1358

Playing with numpy reshaping, we retrieve the good form for the CPT from the pandas cross-table

In [13]:
gnb.sideBySide('<pre>'+str(np.array((c/c.sum().apply(np.float32)).transpose()).reshape(2,2,2))+'</pre>',
               bn.cpt(bn.idFromName('dyspnoea')),
               captions=["<h3>Learned parameters in crosstab","<h3>Original parameters in bn</h3>"])
[[[0.86429664 0.13570336]
  [0.83327319 0.16672681]]

 [[0.56373677 0.43626323]
  [0.44344262 0.55655738]]]
dyspnoea
bronchitis
tuberculos_or_cancer
0
1
0
0
0.85290.1471
1
0.55720.4428
1
0
0.84350.1565
1
0.45490.5451

Learned parameters in crosstab

Original parameters in bn

A global method for estimating Bayesian Network parameters from CSV file using PANDAS

In [14]:
def computeCPTfromDF(bn,df,name):
    """
    Compute the CPT of variable "name" in the BN bn from the database df
    """
    id=bn.idFromName(name)
    domains=[bn.variableFromName(name).domainSize() 
             for name in bn.cpt(id).var_names]

    parents=list(bn.cpt(id).var_names)
    parents.pop()

    c=pandas.crosstab(df[name],[df[parent] for parent in parents])

    s=c.sum()
    
    # if c is monodimensionnal then s will be a float and not a Series 
    if type(s)==pandas.core.series.Series:
        s=s.apply(np.float32)
    else:
        s=float(s)
    
    bn.cpt(id)[:]=np.array((c/s).transpose()).reshape(*domains)
    
def ParametersLearning(bn,df):
    """
    Compute the CPTs of every varaible in the BN bn from the database df
    """
    for name in bn.names():
        computeCPTfromDF(bn,df,name)
In [15]:
ParametersLearning(bn2,df)

KL has decreased a lot (if everything's OK)

In [16]:
g1=gum.ExactBNdistance(bn,bn2)
print("BEFORE LEARNING")
print(before_learning['klPQ'])
print
print("AFTER LEARNING")
print(g1.compute()['klPQ'])
BEFORE LEARNING
5.3266932860181795
AFTER LEARNING
0.0010335395114929016

And CPTs should be close

In [17]:
gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=["<h3>Original BN","<h3>learned BN</h3>"])
positive_XraY
tuberculos_or_cancer
0
1
0
0.61720.3828
1
0.59560.4044
positive_XraY
tuberculos_or_cancer
0
1
0
0.61650.3835
1
0.60070.3993

Original BN

learned BN

Influence of the size of the database on the quality of learned parameters

What is the effect of increasing the size of the database on the KL ? We expect that the KL decreases to 0.

In [18]:
res=[]
for i in range(200,10001,50):
    ParametersLearning(bn2,df[:i])
    g1=gum.ExactBNdistance(bn,bn2)
    res.append(g1.compute()['klPQ'])
fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)
ax.plot(range(200,10001,50),res)
ax.set_xlabel("size of the database")
ax.set_ylabel("KL")
t=ax.set_title("klPQ(bn,learnedBN(x))")
In [ ]: