Click here to hide/show the list of notebooks.
  pyAgrum on notebooks   pyAgrum jupyter
☰  parametersLearningWithPandas 
pyAgrum 0.15.1   
Zipped notebooks   
generation: 2019-06-16 19:06  

Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

In [1]:
%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

Initialisation

  • importing pyAgrum
  • importing pyAgrum.lib tools
  • loading a BN
In [2]:
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

Loading two BNs

In [3]:
bn=gum.loadBN(os.path.join("res","asia.bif"))
bn2=gum.loadBN(os.path.join("res","asia.bif"))

gnb.sideBySide(bn,bn2,
               captions=['First bn','Second bn'])
G visit_to_Asia? visit_to_Asia? tuberculosis? tuberculosis? visit_to_Asia?->tuberculosis? tuberculos_or_cancer? tuberculos_or_cancer? tuberculosis?->tuberculos_or_cancer? positive_XraY? positive_XraY? tuberculos_or_cancer?->positive_XraY? dyspnoea? dyspnoea? tuberculos_or_cancer?->dyspnoea? lung_cancer? lung_cancer? lung_cancer?->tuberculos_or_cancer? smoking? smoking? smoking?->lung_cancer? bronchitis? bronchitis? smoking?->bronchitis? bronchitis?->dyspnoea?
G visit_to_Asia? visit_to_Asia? tuberculosis? tuberculosis? visit_to_Asia?->tuberculosis? tuberculos_or_cancer? tuberculos_or_cancer? tuberculosis?->tuberculos_or_cancer? positive_XraY? positive_XraY? tuberculos_or_cancer?->positive_XraY? dyspnoea? dyspnoea? tuberculos_or_cancer?->dyspnoea? lung_cancer? lung_cancer? lung_cancer?->tuberculos_or_cancer? smoking? smoking? smoking?->lung_cancer? bronchitis? bronchitis? smoking?->bronchitis? bronchitis?->dyspnoea?
First bn
Second bn

Randomizing the parameters

In [4]:
bn.generateCPTs()
bn2.generateCPTs()

Direct comparison of parameters

In [5]:
from IPython.display import HTML

gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=['<h3>cpt of node 3 in first bn</h3>','<h3>same cpt in second bn</h3>'])
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.69580.3042
1
0.34520.6548
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.93580.0642
1
0.39080.6092

cpt of node 3 in first bn

same cpt in second bn

Exact KL-divergence

Since the BN is not too big, BruteForceKL can be computed ...

In [6]:
g1=gum.BruteForceKL(bn,bn2)
before_learning=g1.compute()
print(before_learning['klPQ'])
** pyAgrum.BruteForceKL is deprecated in pyAgrum>0.12.6.
** A pyAgrum.ExactBNdistance has been created.
3.226842691050376

Just to be sure that the distance between a BN and itself is 0 :

In [7]:
g0=gum.BruteForceKL(bn,bn)
print(g0.compute()['klPQ'])
** pyAgrum.BruteForceKL is deprecated in pyAgrum>0.12.6.
** A pyAgrum.ExactBNdistance has been created.
0.0

Generate a database from the original BN

In [8]:
gum.generateCSV(bn,os.path.join("out","test.csv"),10000,True)
 out/test.csv : [ ############################################################ ] 100%
Log2-Likelihood : -65894.21061844246
Out[8]:
-65894.21061844246

Using pandas for _counting

In [9]:
import pandas
df=pandas.read_csv(os.path.join("out","test.csv"))
df.head()
Out[9]:
tuberculos_or_cancer? smoking? visit_to_Asia? lung_cancer? positive_XraY? bronchitis? dyspnoea? tuberculosis?
0 1 0 1 0 1 0 1 1
1 1 0 0 1 0 0 1 1
2 1 0 0 1 0 1 1 1
3 0 0 0 1 0 0 0 0
4 1 0 1 0 1 1 1 1

We use the crosstab function in pandas

In [10]:
c=pandas.crosstab(df['dyspnoea?'],[df['tuberculos_or_cancer?'],df['bronchitis?']])
c
Out[10]:
tuberculos_or_cancer? 0 1
bronchitis? 0 1 0 1
dyspnoea?
0 1465 2931 750 19
1 278 487 872 3198

Playing with numpy reshaping, we retrieve the good form for the CPT from the pandas cross-table

In [11]:
gnb.sideBySide('<pre>'+str(np.array((c/c.sum().apply(np.float32)).transpose()).reshape(2,2,2))+'</pre>',
               bn.cpt(bn.idFromName('dyspnoea?')),
               captions=["<h3>Learned parameters in crosstab","<h3>Original parameters in bn</h3>"])
[[[0.84050488 0.15949512]
  [0.85751902 0.14248098]]

 [[0.46239211 0.53760789]
  [0.00590612 0.99409388]]]
dyspnoea?
tuberculos_or_cancer?
bronchitis?
0
1
0
0
0.84900.1510
1
0.45640.5436
0
1
0.84910.1509
1
0.00520.9948

Learned parameters in crosstab

Original parameters in bn

A global method for estimating Bayesian Network parameters from CSV file using PANDAS

In [12]:
def computeCPTfromDF(bn,df,name):
    """
    Compute the CPT of variable "name" in the BN bn from the database df
    """
    id=bn.idFromName(name)
    domains=[bn.variableFromName(name).domainSize() 
             for name in bn.cpt(id).var_names]

    parents=list(bn.cpt(id).var_names)
    parents.pop()

    c=pandas.crosstab(df[name],[df[parent] for parent in parents])

    s=c.sum()
    
    # if c is monodimensionnal then s will be a float and not a Series 
    if type(s)==pandas.core.series.Series:
        s=s.apply(np.float32)
    else:
        s=float(s)
    
    bn.cpt(id)[:]=np.array((c/s).transpose()).reshape(*domains)
    
def ParametersLearning(bn,df):
    """
    Compute the CPTs of every varaible in the BN bn from the database df
    """
    for name in bn.names():
        computeCPTfromDF(bn,df,name)
In [13]:
ParametersLearning(bn2,df)

KL has decreased a lot (if everything's OK)

In [14]:
g1=gum.ExactBNdistance(bn,bn2)
print("BEFORE LEARNING")
print(before_learning['klPQ'])
print
print("AFTER LEARNING")
print(g1.compute()['klPQ'])
BEFORE LEARNING
3.226842691050376
AFTER LEARNING
0.0014089086465641461

And CPTs should be close

In [15]:
gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=["<h3>Original BN","<h3>learned BN</h3>"])
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.69580.3042
1
0.34520.6548
positive_XraY?
tuberculos_or_cancer?
0
1
0
0.68480.3152
1
0.34120.6588

Original BN

learned BN

Influence of the size of the database on the quality of learned parameters

What is the effect of increasing the size of the database on the KL ? We expect that the KL decreases to 0.

In [16]:
res=[]
for i in range(200,10001,50):
    ParametersLearning(bn2,df[:i])
    g1=gum.ExactBNdistance(bn,bn2)
    res.append(g1.compute()['klPQ'])
fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)
ax.plot(range(200,10001,50),res)
ax.set_xlabel("size of the database")
ax.set_ylabel("KL")
t=ax.set_title("klPQ(bn,learnedBN(x))")
In [ ]: