Parameter learning with Pandas¶

This notebook uses pandas to learn the parameters.

However the simplest way to learn parameters is to use BNLearner :-).

Moreover, you will be able to add priors, etc (see learning BN).

| | |

In [1]:

%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt

import os

Importing pyAgrum¶

In [2]:

import pyAgrum as gum
import pyAgrum.lib.notebook as gnb

Loading two BNs¶

In [3]:

bn=gum.loadBN("res/asia.bif")
bn2=gum.loadBN("res/asia.bif")

gnb.sideBySide(bn,bn2,
               captions=['First bn','Second bn'])

First bn

Second bn

Randomizing the parameters¶

In [4]:

bn.generateCPTs()
bn2.generateCPTs()

Direct comparison of parameters¶

In [5]:

from IPython.display import HTML

gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=['<h3>cpt of node 3 in first bn</h3>','<h3>same cpt in second bn</h3>'])

	positive_XraY
tuberculos_or_cancer	0	1
0	0.3054	0.6946
1	0.6264	0.3736

cpt of node 3 in first bn

	positive_XraY
tuberculos_or_cancer	0	1
0	0.1988	0.8012
1	0.2449	0.7551

same cpt in second bn

Exact KL-divergence¶

Since the BN is not too big, BruteForceKL can be computed ...

In [6]:

g1=gum.ExactBNdistance(bn,bn2)
before_learning=g1.compute()
print(f"klPQ computed : {before_learning['klPQ']}")

klPQ computed : 3.386494117811155

Just to be sure that the distance between a BN and itself is 0 :

In [7]:

g0=gum.ExactBNdistance(bn,bn)
print(g0.compute()['klPQ'])

0.0

Generate a database from the original BN¶

In [8]:

gum.generateSample(bn,10000,"out/test.csv",True)

out/test.csv: 100%|████████████████████████████████████████|

Log2-Likelihood : -64321.49612112121

Out[8]:

-64321.49612112121

Using pandas for _counting¶

As an exercise, we will use pandas to learn the parameters.

In [9]:

# using bn as a template for the specification of variables in test.csv
learner=gum.BNLearner("out/test.csv",bn)
bn3=learner.learnParameters(bn.dag())

#the same but we add a Laplace adjustment (smoothing) as a Prior
learner=gum.BNLearner("out/test.csv",bn)
learner.useSmoothingPrior(1000) # a count C is replaced by C+1000
bn4=learner.learnParameters(bn.dag())

after_pyAgrum_learning=gum.ExactBNdistance(bn,bn3).compute()
after_pyAgrum_learning_with_smoothing=gum.ExactBNdistance(bn,bn4).compute()
print("without prior:{}".format(after_pyAgrum_learning['klPQ']))
print("with prior smooting(1000):{}".format(after_pyAgrum_learning_with_smoothing['klPQ']))

without prior:0.001414938713846417
with prior smooting(1000):0.26339482011584403

Now, let's try to learn the parameters with pandas¶

In [10]:

import pandas

In [11]:

# We directly generate samples in a DataFrame
df,_=gum.generateSample(bn,10000,None,True)

100%|██████████████████████████████████████████████████████|

Log2-Likelihood : -64119.65915215957

In [12]:

df.head()

Out[12]:

	tuberculosis	smoking	tuberculos_or_cancer	bronchitis	lung_cancer	dyspnoea	positive_XraY	visit_to_Asia
0	1	0	1	1	0	1	1	1
1	0	0	1	0	0	0	0	0
2	0	1	0	0	1	1	1	0
3	1	1	0	0	1	0	1	1
4	1	0	1	0	0	0	1	1

We use the crosstab function in pandas

In [13]:

c=pandas.crosstab(df['dyspnoea'],[df['tuberculos_or_cancer'],df['bronchitis']])
c

Out[13]:

tuberculos_or_cancer	0		1
bronchitis	0	1	0	1
dyspnoea
0	1680	189	3045	209
1	1040	990	1315	1532

Playing with numpy reshaping, we retrieve the good form for the CPT from the pandas cross-table

In [14]:

gnb.sideBySide('<pre>'+str(np.array((c/c.sum().apply(np.float32)).transpose()).reshape(2,2,2))+'</pre>',
               bn.cpt('dyspnoea'),
               captions=["<h3>Learned parameters in crosstab","<h3>Original parameters in bn</h3>"])

[[[0.61764706 0.38235294]
  [0.16030534 0.83969466]]

 [[0.6983945  0.3016055 ]
  [0.12004595 0.87995405]]]

Learned parameters in crosstab

		dyspnoea
bronchitis	tuberculos_or_cancer	0	1
0	0	0.6018	0.3982
0	1	0.6944	0.3056
1	0	0.1398	0.8602
1	1	0.1088	0.8912

Original parameters in bn

A global method for estimating Bayesian network parameters from CSV file using PANDAS¶

In [15]:

def computeCPTfromDF(bn,df,name):
    """
    Compute the CPT of variable "name" in the BN bn from the database df
    """
    id=bn.idFromName(name)
    parents=list(reversed(bn.cpt(id).names))
    domains=[bn[name].domainSize()
             for name in parents]
    parents.pop()

    if (len(parents)>0):
        c=pandas.crosstab(df[name],[df[parent] for parent in parents])
        s=c/c.sum().apply(np.float32)
    else:
        s=df[name].value_counts(normalize=True)

    bn.cpt(id)[:]=np.array((s).transpose()).reshape(*domains)

def ParametersLearning(bn,df):
    """
    Compute the CPTs of every varaible in the BN bn from the database df
    """
    for name in bn.names():
        computeCPTfromDF(bn,df,name)

In [16]:

ParametersLearning(bn2,df)

KL has decreased a lot (if everything's OK)

In [17]:

g1=gum.ExactBNdistance(bn,bn2)
print("BEFORE LEARNING")
print(before_learning['klPQ'])
print
print("AFTER LEARNING")
print(g1.compute()['klPQ'])

BEFORE LEARNING
3.386494117811155
AFTER LEARNING
0.3871200194397683

And CPTs should be close

In [18]:

gnb.sideBySide(bn.cpt(3),
               bn2.cpt(3),
               captions=["<h3>Original BN","<h3>learned BN</h3>"])

	positive_XraY
tuberculos_or_cancer	0	1
0	0.3054	0.6946
1	0.6264	0.3736

Original BN

	positive_XraY
tuberculos_or_cancer	0	1
0	0.3021	0.6979
1	0.6286	0.3714

learned BN

Influence of the size of the database on the quality of learned parameters¶

What is the effect of increasing the size of the database on the KL ? We expect that the KL decreases to 0.

In [19]:

res=[]
for i in range(200,10001,50):
    ParametersLearning(bn2,df[:i])
    g1=gum.ExactBNdistance(bn,bn2)
    res.append(g1.compute()['klPQ'])
fig=figure(figsize=(10,6))
ax  = fig.add_subplot(1, 1, 1)
ax.plot(range(200,10001,50),res)
ax.set_xlabel("size of the database")
ax.set_ylabel("KL")
ax.set_title("klPQ(bn,learnedBN(x))");

No description has been provided for this image

In [ ]:

Based on a theme provided by Read the Docs.	pyAgrum 1.13.0 © Copyright 2018-21, aGrUM/pyAgrum Team	Download the notebooks
Generated on 2024-03-28 11:08