Click here to hide/show the list of notebooks.
  pyAgrum on notebooks   pyAgrum jupyter
☰  parametersLearningWithPandas 
pyAgrum 0.16.3   
Zipped notebooks   
generation: 2019-10-20 09:16  

Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

In [1]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

Initialisation

  • importing pyAgrum
  • importing pyAgrum.lib tools
  • loading a BN
In [2]:
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

Loading two BNs

In [3]:
bn=gum.loadBN(os.path.join("res","asia.bif"))
bn2=gum.loadBN(os.path.join("res","asia.bif"))

gnb.sideBySide(bn,bn2,
               captions=['First bn','Second bn'])
G visit_to_Asia? visit_to_Asia? tuberculosis? tuberculosis? visit_to_Asia?->tuberculosis? tuberculos_or_cancer? tuberculos_or_cancer? tuberculosis?->tuberculos_or_cancer? positive_XraY? positive_XraY? tuberculos_or_cancer?->positive_XraY? dyspnoea? dyspnoea? tuberculos_or_cancer?->dyspnoea? lung_cancer? lung_cancer? lung_cancer?->tuberculos_or_cancer? smoking? smoking? smoking?->lung_cancer? bronchitis? bronchitis? smoking?->bronchitis? bronchitis?->dyspnoea?
G visit_to_Asia? visit_to_Asia? tuberculosis? tuberculosis? visit_to_Asia?->tuberculosis? tuberculos_or_cancer? tuberculos_or_cancer? tuberculosis?->tuberculos_or_cancer? positive_XraY? positive_XraY? tuberculos_or_cancer?->positive_XraY? dyspnoea? dyspnoea? tuberculos_or_cancer?->dyspnoea? lung_cancer? lung_cancer? lung_cancer?->tuberculos_or_cancer? smoking? smoking? smoking?->lung_cancer? bronchitis? bronchitis? smoking?->bronchitis? bronchitis?->dyspnoea?
First bn
Second bn

Randomizing the parameters

In [4]:
bn.generateCPTs()
bn2.generateCPTs()

Direct comparison of parameters

In [5]:
from IPython.display import HTML

gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=['<h3>cpt of node 3 in first bn</h3>','<h3>same cpt in second bn</h3>'])
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.50090.4991
1
0.26230.7377
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.17620.8238
1
0.77620.2238

cpt of node 3 in first bn

same cpt in second bn

Exact KL-divergence

Since the BN is not too big, BruteForceKL can be computed ...

In [6]:
g1=gum.ExactBNdistance(bn,bn2)
before_learning=g1.compute()
print(before_learning['klPQ'])
2.62293203195502

Just to be sure that the distance between a BN and itself is 0 :

In [7]:
g0=gum.ExactBNdistance(bn,bn)
print(g0.compute()['klPQ'])
0.0

Generate a database from the original BN

In [8]:
gum.generateCSV(bn,os.path.join("out","test.csv"),10000,True)
 out/test.csv : [ ############################################################ ] 100%
Log2-Likelihood : -70030.97151790354
Out[8]:
-70030.97151790354

Using pandas for _counting

As an exercise, we will use pandas to learn the parameters. However the simplest way to learn parameters is to use `BNLearner` :-). Moreover, you will be able to add priors, etc.

In [9]:
# using bn as a template for the specification of variables in test.csv
learner=gum.BNLearner(os.path.join("out","test.csv"),bn) 
bn3=learner.learnParameters(bn.dag())

#the same but we add a Laplace adjustment as a Prior
learner=gum.BNLearner(os.path.join("out","test.csv"),bn) 
learner.useAprioriSmoothing(1000) # a count C is replaced by C+1000
bn4=learner.learnParameters(bn.dag())

after_pyAgrum_learning=gum.ExactBNdistance(bn,bn3).compute()
after_pyAgrum_learning_with_laplace=gum.ExactBNdistance(bn,bn4).compute()
print("without priori :{}".format(after_pyAgrum_learning['klPQ']))
print("with prior smooting(1000):{}".format(after_pyAgrum_learning_with_laplace['klPQ']))
without priori :0.0013021602565348581
with prior smooting(1000):0.1740769276171101

Now, let's try to learn the parameters with pandas

In [10]:
import pandas
df=pandas.read_csv(os.path.join("out","test.csv"))
df.head()
Out[10]:
visit_to_Asia? positive_XraY? tuberculosis? tuberculos_or_cancer? smoking? lung_cancer? bronchitis? dyspnoea?
0 1 1 0 1 1 0 0 0
1 0 0 0 0 1 1 1 1
2 1 1 1 0 1 0 1 0
3 1 0 0 0 0 0 0 0
4 1 1 1 0 0 0 1 0

We use the crosstab function in pandas

In [11]:
c=pandas.crosstab(df['dyspnoea?'],[df['tuberculos_or_cancer?'],df['bronchitis?']])
c
Out[11]:
tuberculos_or_cancer? 0 1
bronchitis? 0 1 0 1
dyspnoea?
0 1436 3141 1150 2348
1 690 864 260 111

Playing with numpy reshaping, we retrieve the good form for the CPT from the pandas cross-table

In [12]:
gnb.sideBySide('<pre>'+str(np.array((c/c.sum().apply(np.float32)).transpose()).reshape(2,2,2))+'</pre>',
               bn.cpt(bn.idFromName('dyspnoea?')),
               captions=["<h3>Learned parameters in crosstab","<h3>Original parameters in bn</h3>"])
[[[0.67544685 0.32455315]
  [0.78426966 0.21573034]]

 [[0.81560284 0.18439716]
  [0.9548597  0.0451403 ]]]
dyspnoea?
bronchitis?
tuberculos_or_cancer?
0
1
0
0
0.66730.3327
1
0.81830.1817
1
0
0.79330.2067
1
0.94700.0530

Learned parameters in crosstab

Original parameters in bn

A global method for estimating Bayesian Network parameters from CSV file using PANDAS

In [13]:
def computeCPTfromDF(bn,df,name):
    """
    Compute the CPT of variable "name" in the BN bn from the database df
    """
    id=bn.idFromName(name)
    domains=[bn.variableFromName(name).domainSize() 
             for name in bn.cpt(id).var_names]

    parents=list(bn.cpt(id).var_names)
    parents.pop()

    c=pandas.crosstab(df[name],[df[parent] for parent in parents])

    s=c.sum()
    
    # if c is monodimensionnal then s will be a float and not a Series 
    if type(s)==pandas.core.series.Series:
        s=s.apply(np.float32)
    else:
        s=float(s)
    
    bn.cpt(id)[:]=np.array((c/s).transpose()).reshape(*domains)
    
def ParametersLearning(bn,df):
    """
    Compute the CPTs of every varaible in the BN bn from the database df
    """
    for name in bn.names():
        computeCPTfromDF(bn,df,name)
In [14]:
ParametersLearning(bn2,df)

KL has decreased a lot (if everything's OK)

In [15]:
g1=gum.ExactBNdistance(bn,bn2)
print("BEFORE LEARNING")
print(before_learning['klPQ'])
print
print("AFTER LEARNING")
print(g1.compute()['klPQ'])
BEFORE LEARNING
2.62293203195502
AFTER LEARNING
0.0013021602565348581

And CPTs should be close

In [16]:
gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=["<h3>Original BN","<h3>learned BN</h3>"])
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.50090.4991
1
0.26230.7377
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.50550.4945
1
0.26730.7327

Original BN

learned BN

Influence of the size of the database on the quality of learned parameters

What is the effect of increasing the size of the database on the KL ? We expect that the KL decreases to 0.

In [17]:
res=[]
for i in range(200,10001,50):
    ParametersLearning(bn2,df[:i])
    g1=gum.ExactBNdistance(bn,bn2)
    res.append(g1.compute()['klPQ'])
fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)
ax.plot(range(200,10001,50),res)
ax.set_xlabel("size of the database")
ax.set_ylabel("KL")
t=ax.set_title("klPQ(bn,learnedBN(x))")