%matplotlib inline
from pylab import *
import matplotlib.pyplot as plt
import os
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import pyAgrum.lib.explain as explain
import pyAgrum.lib.bn_vs_bn as bnvsbn
gum.about()
gnb.configuration()
pyAgrum 1.8.1 (c) 2015-2023 Pierre-Henri Wuillemin, Christophe Gonzales This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. For details, see 'pyAgrum.warranty'.
Library | Version |
---|---|
OS | posix [linux] |
Python | 3.10.10 (main, Mar 5 2023, 22:26:53) [GCC 12.2.1 20230201] |
IPython | 8.13.2 |
Matplotlib | 3.7.1 |
Numpy | 1.24.3 |
pyDot | 1.4.2 |
pyAgrum | 1.8.1 |
bn=gum.loadBN("res/asia.bif")
bn
gum.generateSample(bn,50000,"out/sample_asia.csv",True);
out/sample_asia.csv: 100%|█████████████████████████████████|
Log2-Likelihood : -161064.05636409108
with open("out/sample_asia.csv","r") as src:
for _ in range(10):
print(src.readline(),end="")
visit_to_Asia,dyspnoea,positive_XraY,lung_cancer,tuberculos_or_cancer,smoking,bronchitis,tuberculosis 1,0,1,1,1,1,1,1 1,0,0,1,1,0,0,1 1,0,1,1,1,1,0,1 1,0,1,1,1,1,0,1 1,0,1,1,1,0,0,1 1,1,1,1,1,1,1,1 1,1,1,1,1,0,0,1 1,0,1,1,1,0,0,1 1,1,1,1,1,1,1,1
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
print(learner)
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Greedy Hill Climbing Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : -
print(f"Row of visit_to_Asia : {learner.idFromName('visit_to_Asia')}") # first row is 0
Row of visit_to_Asia : 0
print(f"Variable in row 4 : {learner.nameFromId(4)}")
Variable in row 4 : lung_cancer
The BNLearner is capable of recognizing missing values in databases. For this purpose, just indicate as a last argument the list of the strings that represent missing values.
# it is possible to add as a last argument a list of the symbols that represent missing values:
# whenever a cell of the database is equal to one of these strings, it is considered as a
# missing value
learner=gum.BNLearner("res/asia_missing.csv",bn, ['?', 'N/A'] )
print(f"Are there missing values in the database ? {learner.state()['Missing values'][0]}")
Are there missing values in the database ? True
When reading a csv file, BNLearner can try to find the correct type for discrete variable. Especially for numeric values.
%%writefile out/testTypeInduction.csv
A,B,C,D
1,2,0,hot
0,3,-2,cold
0,1,2,hot
1,2,2,warm
Overwriting out/testTypeInduction.csv
print("* by default, type induction is on (True) :")
learner=gum.BNLearner("out/testTypeInduction.csv")
bn3=learner.learnBN()
for v in sorted(bn3.names()):
print(f" - {bn3.variable(v)}")
print("")
print("* but you can disable it :")
learner=gum.BNLearner("out/testTypeInduction.csv",["?"],False)
bn3=learner.learnBN()
for v in sorted(bn3.names()):
print(f" - {bn3.variable(v)}")
print("")
print("Note that when a Labelized variable is found, the labesl are alphabetically sorted.")
* by default, type induction is on (True) : - A:Range([0,1]) - B:Range([1,3]) - C:Integer({-2|0|2}) - D:Labelized({cold|hot|warm}) * but you can disable it : - A:Labelized({0|1}) - B:Labelized({1|2|3}) - C:Labelized({-2|0|2}) - D:Labelized({cold|hot|warm}) Note that when a Labelized variable is found, the labesl are alphabetically sorted.
We give the $bn$ as a parameter for the learner in order to have the variables and the order of the labels for each variables. Please try to remove the argument $bn$ in the first line below to see the difference ...
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables and labels
bn2=learner.learnParameters(bn.dag())
gnb.showBN(bn2)
from IPython.display import HTML
gnb.sideBySide("<H3>Original BN</H3>","<H3>Learned NB</H3>",
bn.cpt ('visit_to_Asia'),bn2.cpt ('visit_to_Asia'),
bn.cpt ('tuberculosis'),bn2.cpt ('tuberculosis'),
ncols=2)
Note that, currently, the BNLearner is not yet able to learn in the presence of missing values. This is the reason why, when it discovers that there exist such values, it raises a gum.MissingValueInDatabase
exception.
with open("res/asia_missing.csv","r") as asiafile:
for _ in range(10):
print(asiafile.readline(),end="")
try:
learner=gum.BNLearner("res/asia_missing.csv",bn, ['?', 'N/A'] )
bn2=learner.learnBN()
except gum.MissingValueInDatabase:
print ( "exception raised: there are missing values in the database" )
smoking,lung_cancer,bronchitis,visit_to_Asia,tuberculosis,tuberculos_or_cancer,dyspnoea,positive_XraY 0,0,0,1,1,0,0,0 1,1,0,1,1,1,0,1 1,1,1,1,1,1,1,1 1,1,0,1,1,1,0,N/A 0,1,0,1,1,1,1,1 1,1,1,1,1,1,1,1 1,1,1,1,1,1,0,1 1,1,0,1,1,1,0,1 1,1,1,1,1,1,1,1 exception raised: there are missing values in the database
For now, there are three scored-based algorithms that are wrapped in pyAgrum : LocalSearchWithTabuList, GreedyHillClimbing and K2
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useLocalSearchWithTabuList()
print(learner)
bn2=learner.learnBN()
print("Learned in {0}ms".format(1000*learner.currentTime()))
gnb.flow.row(bn,bn2,explain.getInformation(bn2),captions=["Original BN","Learned BN","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Local Search with Tabu List Tabu list size : 2 Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Learned in 8.558417ms
To apprehend the distance between the original and the learned BN, we have several tools :
kl=gum.ExactBNdistance(bn,bn2)
kl.compute()
{'klPQ': 0.00034141582374694747, 'errorPQ': 0, 'klQP': 0.00030181045210608067, 'errorQP': 128, 'hellinger': 0.011018931976990266, 'bhattacharya': 6.0704631579847706e-05, 'jensen-shannon': 8.532721667204006e-05}
gcmp=bnvsbn.GraphicalBNComparator(bn,bn2)
gnb.flow.add(bnvsbn.graphDiff(bn,bn2))
gnb.flow.add(bnvsbn.graphDiffLegend())
gnb.flow.new_line()
gnb.flow.add_html("<br/>".join([f"{k} : {v:.2f}" for k,v in gcmp.skeletonScores().items() if k!='count']),"Skeleton scores")
gnb.flow.add_html("<br/>".join([f"{k} : {v:.2f}" for k,v in gcmp.scores().items() if k!='count']),"Scores")
gnb.flow.display()
A greedy Hill Climbing algorithm (with insert, remove and change arc as atomic operations).
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useGreedyHillClimbing()
print(learner)
bn2=learner.learnBN()
print("Learned in {0}ms".format(1000*learner.currentTime()))
gnb.sideBySide(bn,bn2,gnb.getBNDiff(bn,bn2),explain.getInformation(bn2),captions=["Original BN","Learned BN","Graphical diff","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Greedy Hill Climbing Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Learned in 6.244893ms
And a K2 for those who likes it :)
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useK2([0,1,2,3,4,5,6,7])
print(learner)
bn2=learner.learnBN()
print("Learned in {0}ms".format(1000*learner.currentTime()))
gnb.sideBySide(bn,bn2,gnb.getBNDiff(bn,bn2),explain.getInformation(bn2),captions=["Original BN","Learned BN","Graphical diff","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : K2 K2 order : visit_to_Asia, tuberculosis, tuberculos_or_cancer, positive_XraY, lung_cancer, smoking, bronchitis, dyspnoea Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Learned in 3.376704ms
K2 can be very good if the order is the good one (a topological order of nodes in the reference)
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useK2([7,6,5,4,3,2,1,0])
print(learner)
bn2=learner.learnBN()
print("Learned in {0}s".format(learner.currentTime()))
gnb.sideBySide(bn,bn2,gnb.getBNDiff(bn,bn2),explain.getInformation(bn2),captions=["Original BN","Learned BN","Graphical diff","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : K2 K2 order : dyspnoea, bronchitis, smoking, lung_cancer, positive_XraY, tuberculos_or_cancer, tuberculosis, visit_to_Asia Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Learned in 0.0032919299999999998s
import numpy as np
%matplotlib inline
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useLocalSearchWithTabuList()
# we could prefere a log2likelihood score
# learner.useScoreLog2Likelihood()
learner.setMaxTime(10)
# representation of the error as a pseudo log (negative values really represents negative epsilon
@np.vectorize
def pseudolog(x):
res=np.log(x)#np.log(y)
return res if x>0 else -res
# in order to control the complexity, we limit the number of parents
learner.setMaxIndegree(7) # no more than 3 parent by node
learner.setEpsilon(1e-10)
gnb.animApproximationScheme(learner,
scale=pseudolog) # scale by default is np.log10
bn2=learner.learnBN()
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useGreedyHillClimbing()
learner.setMaxIndegree(1) # no more than 1 parent by node
print(learner)
bntree=learner.learnBN()
gnb.sideBySide(bn,bntree,gnb.getBNDiff(bn,bntree),explain.getInformation(bntree),captions=["Original BN","Learned BN","Graphical diff","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Greedy Hill Climbing Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Constraint Max InDegree : 1 (Used only for score-based algorithms.)
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useGreedyHillClimbing()
# I know that smoking causes cancer
learner.addMandatoryArc("smoking","lung_cancer") # smoking->lung_cancer
# I know that visit to Asia may change the risk of tuberculosis
learner.addMandatoryArc("visit_to_Asia","tuberculosis") # visit_to_Asia->tuberculosis
print(learner)
bn2=learner.learnBN()
gnb.sideBySide(bn,bn2,gnb.getBNDiff(bn,bn2),explain.getInformation(bn2),captions=["Original BN","Learned BN","Graphical diff","information"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Greedy Hill Climbing Score : BDeu Correction : MDL (Not used for score-based algorithms) Prior : - Constraint Mandatory Arcs : {visit_to_Asia->tuberculosis, smoking->lung_cancer}
By default, a BDEU score is used. But it can be changed.
learner=gum.BNLearner("out/sample_asia.csv",bn) #using bn as template for variables
learner.useGreedyHillClimbing()
# I know that smoking causes cancer
learner.addMandatoryArc(0,1)
# we prefere a log2likelihood score
learner.useScoreLog2Likelihood()
# in order to control the complexity, we limit the number of parents
learner.setMaxIndegree(1) # no more than 1 parent by node
print(learner)
bn2=learner.learnBN()
kl=gum.ExactBNdistance(bn,bn2)
gnb.sideBySide(bn,bn2,gnb.getBNDiff(bn,bn2),
"<br/>".join(["<b>"+k+"</b> :"+str(v) for k,v in kl.compute().items()]),
captions=["original","learned BN","diff","distances"])
Filename : out/sample_asia.csv Size : (50000,8) Variables : visit_to_Asia[2], tuberculosis[2], tuberculos_or_cancer[2], positive_XraY[2], lung_cancer[2], smoking[2], bronchitis[2], dyspnoea[2] Induced types : False Missing values : False Algorithm : Greedy Hill Climbing Score : Log2Likelihood Correction : MDL (Not used for score-based algorithms) Prior : - Constraint Max InDegree : 1 (Used only for score-based algorithms.) Constraint Mandatory Arcs : {visit_to_Asia->tuberculosis}
There are multiple ways to compare Bayes net...
help(gnb.getBNDiff)
Help on function getBNDiff in module pyAgrum.lib.notebook: getBNDiff(bn1, bn2, size=None, noStyle=False) get a HTML string representation of a graphical diff between the arcs of _bn1 (reference) with those of _bn2. if `noStyle` is False use 4 styles (fixed in pyAgrum.config) : - the arc is common for both - the arc is common but inverted in `bn2` - the arc is added in `bn2` - the arc is removed in `bn2` Parameters ---------- bn1: pyAgrum.BayesNet the reference bn2: pyAgrum.BayesNet the compared one size: float|str size (for graphviz) of the rendered graph noStyle: bool with style or not. Returns ------- str the HTML representation of the comparison
gnb.showBNDiff(bn,bn2)
import pyAgrum.lib.bn_vs_bn as gbnbn
help(gbnbn.graphDiff)
Help on function graphDiff in module pyAgrum.lib.bn_vs_bn: graphDiff(bnref, bncmp, noStyle=False) Return a pydot graph that compares the arcs of bnref to bncmp. graphDiff allows bncmp to have less nodes than bnref. (this is not the case in GraphicalBNComparator.dotDiff()) if noStyle is False use 4 styles (fixed in pyAgrum.config) : - the arc is common for both - the arc is common but inverted in _bn2 - the arc is added in _bn2 - the arc is removed in _bn2 See graphDiffLegend() to add a legend to the graph. Warning ------- if pydot is not installed, this function just returns None Returns ------- pydot.Dot the result dot graph or None if pydot can not be imported
gbnbn.GraphicalBNComparator?
Init signature: gbnbn.GraphicalBNComparator(name1, name2, delta=1e-06) Docstring: BNGraphicalComparator allows to compare in multiple way 2 BNs...The smallest assumption is that the names of the variables are the same in the 2 BNs. But some comparisons will have also to check the type and domainSize of the variables. The bns have not exactly the same role : _bn1 is rather the referent model for the comparison whereas _bn2 is the compared one to the referent model. Parameters ---------- name1 : str or pyAgrum.BayesNet a BN or a filename for reference name2 : str or pyAgrum.BayesNet another BN or antoher filename for comparison File: ~/.virtualenvs/devAgrum/lib/python3.10/site-packages/pyAgrum/lib/bn_vs_bn.py Type: type Subclasses:
gcmp=gbnbn.GraphicalBNComparator(bn,bn2)
gnb.sideBySide(bn,bn2,gcmp.dotDiff(),gbnbn.graphDiffLegend(),
bn2,bn,gbnbn.graphDiff(bn2,bn),gbnbn.graphDiffLegend(),
ncols=4)