Click here to hide/show the list of notebooks.
  pyAgrum on notebooks   pyAgrum jupyter
☰  KaggleTitanic 
pyAgrum 0.15.1   
Zipped notebooks   
generation: 2019-06-16 19:06  

Creative Commons License
This pyAgrum's notebook is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

In [1]:
import pandas
import os
import math
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
from pyAgrum.lib.bn2roc import showROC

Titanic: Machine Learning from Disaster

This notebook is an introduction to the Kaggle titanic challenge. The goal here is not to produce the best possible classifier, at least not yet, but to show how pyAgrum and Bayesian Networks can be used to easily and quickly explore and understand data.

To undestand this notebook, basic knowledge of Bayesian Networks is required. If you are looking for an introduction to pyAgrum, check this notebook.

This notebook present three different Bayesien Networks techniques to answer the Kaggle Titanic challenge. The first approach we will answer the challenge without using the training set and we will only use our prior knowledge about shipwrecks. In the second approach we will only use the training set with pyAgrum's machine learning algorithms. Finally, in the third approach we will use both prior knowledge about shipwrecks and machine learning.

Before we start, some disclaimers about aGrUM and pyAgrum.

aGrUM is a C++ library designed for easily building applications using graphical models such as Bayesian networks, influence diagrams, decision trees or Markov decision processes.

pyAgrum is a Python wrapper for the C++ aGrUM library. It provides a high-level interface to the part of aGrUM allowing to create, handle and make computations into Bayesian Networks. The module mainly is an application of the SWIG interface generator. Custom-written code is added to simplify and extend the aGrUM API.

Both projects are open source and can be freely downloaded from aGrUM's gitlab repository or installed using pip or anaconda.

If you have questions, remarks or suggestions, feel free to ask us on info@agrum.org.

Pretreatment

We will be using pandas to setup the learning data to fit with pyAgrum requirements.

In [2]:
traindf=pandas.read_csv(os.path.join('res', 'titanic', 'train.csv'))

testdf=pandas.merge(pandas.read_csv(os.path.join('res', 'titanic', 'test.csv')),
                    pandas.read_csv(os.path.join('res', 'titanic', 'gender_submission.csv')),
                    on="PassengerId")

This merges both the test base with the fact that a passager has survived or not.

In [3]:
traindf.var()
Out[3]:
PassengerId    66231.000000
Survived           0.236772
Pclass             0.699015
Age              211.019125
SibSp              1.216043
Parch              0.649728
Fare            2469.436846
dtype: float64
In [4]:
for k in traindf.keys():
    print('{0}: {1}'.format(k, len(traindf[k].unique())))
PassengerId: 891
Survived: 2
Pclass: 3
Name: 891
Sex: 2
Age: 89
SibSp: 7
Parch: 7
Ticket: 681
Fare: 248
Cabin: 148
Embarked: 4

Looking at the number of unique values for each variable is necessary since Bayesian Networks are discrete models. We will want to reduce the domain size of some discrete varaibles (like age) and discretize continuous variables (like Fare).

For starters you can filter out variables with a large number of values. Choosing a large number will have an impact on performances, which boils down to how much CPU and RAM you have at your disposal. Here, we choose to filter out any variable with more than 10 different outcomes.

In [5]:
for k in traindf.keys():
    if len(traindf[k].unique())<=15:
        print(k)
Survived
Pclass
Sex
SibSp
Parch
Embarked

This leaves us with 6 variables, not much but still enough to learn a Bayesian Network. Will just add one more variable by reducing the cardinality of the Age variable.

In [6]:
testdf=pandas.merge(pandas.read_csv(os.path.join('res', 'titanic', 'test.csv')),
                    pandas.read_csv(os.path.join('res', 'titanic', 'gender_submission.csv')),
                    on="PassengerId")

def forAge(row):
    try:
        age = float(row['Age'])
        if age < 1:
            #return '[0;1['
            return 'baby'
        elif age < 6:
            #return '[1;6['
            return 'toddler'
        elif age < 12:
            #return '[6;12['
            return 'kid'
        elif age < 21:
            #return '[12;21['
            return 'teen'
        elif age < 80:
            #return '[21;80['
            return 'adult'
        else:
            #return '[80;200]'
            return 'old'
    except ValueError:
        return np.nan
    
def forBoolean(row, col):
    try:
        val = int(row[col])
        if row[col] >= 1:
            return "True"
        else:
            return "False"
    except ValueError:
        return "False"
    
def forGender(row):
    if row['Sex'] == "male":
        return "Male"
    else:
        return "Female"
        

testdf
Out[6]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 0
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 1
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 0
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 0
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S 1
5 897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 NaN S 0
6 898 3 Connolly, Miss. Kate female 30.0 0 0 330972 7.6292 NaN Q 1
7 899 2 Caldwell, Mr. Albert Francis male 26.0 1 1 248738 29.0000 NaN S 0
8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 NaN C 1
9 901 3 Davies, Mr. John Samuel male 21.0 2 0 A/4 48871 24.1500 NaN S 0
10 902 3 Ilieff, Mr. Ylio male NaN 0 0 349220 7.8958 NaN S 0
11 903 1 Jones, Mr. Charles Cresson male 46.0 0 0 694 26.0000 NaN S 0
12 904 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) female 23.0 1 0 21228 82.2667 B45 S 1
13 905 2 Howard, Mr. Benjamin male 63.0 1 0 24065 26.0000 NaN S 0
14 906 1 Chaffee, Mrs. Herbert Fuller (Carrie Constance... female 47.0 1 0 W.E.P. 5734 61.1750 E31 S 1
15 907 2 del Carlo, Mrs. Sebastiano (Argenia Genovesi) female 24.0 1 0 SC/PARIS 2167 27.7208 NaN C 1
16 908 2 Keane, Mr. Daniel male 35.0 0 0 233734 12.3500 NaN Q 0
17 909 3 Assaf, Mr. Gerios male 21.0 0 0 2692 7.2250 NaN C 0
18 910 3 Ilmakangas, Miss. Ida Livija female 27.0 1 0 STON/O2. 3101270 7.9250 NaN S 1
19 911 3 Assaf Khalil, Mrs. Mariana (Miriam")" female 45.0 0 0 2696 7.2250 NaN C 1
20 912 1 Rothschild, Mr. Martin male 55.0 1 0 PC 17603 59.4000 NaN C 0
21 913 3 Olsen, Master. Artur Karl male 9.0 0 1 C 17368 3.1708 NaN S 0
22 914 1 Flegenheim, Mrs. Alfred (Antoinette) female NaN 0 0 PC 17598 31.6833 NaN S 1
23 915 1 Williams, Mr. Richard Norris II male 21.0 0 1 PC 17597 61.3792 NaN C 0
24 916 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie) female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C 1
25 917 3 Robins, Mr. Alexander A male 50.0 1 0 A/5. 3337 14.5000 NaN S 0
26 918 1 Ostby, Miss. Helene Ragnhild female 22.0 0 1 113509 61.9792 B36 C 1
27 919 3 Daher, Mr. Shedid male 22.5 0 0 2698 7.2250 NaN C 0
28 920 1 Brady, Mr. John Bertram male 41.0 0 0 113054 30.5000 A21 S 0
29 921 3 Samaan, Mr. Elias male NaN 2 0 2662 21.6792 NaN C 0
... ... ... ... ... ... ... ... ... ... ... ... ...
388 1280 3 Canavan, Mr. Patrick male 21.0 0 0 364858 7.7500 NaN Q 0
389 1281 3 Palsson, Master. Paul Folke male 6.0 3 1 349909 21.0750 NaN S 0
390 1282 1 Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5000 B24 S 0
391 1283 1 Lines, Mrs. Ernest H (Elizabeth Lindsey James) female 51.0 0 1 PC 17592 39.4000 D28 S 1
392 1284 3 Abbott, Master. Eugene Joseph male 13.0 0 2 C.A. 2673 20.2500 NaN S 0
393 1285 2 Gilbert, Mr. William male 47.0 0 0 C.A. 30769 10.5000 NaN S 0
394 1286 3 Kink-Heilmann, Mr. Anton male 29.0 3 1 315153 22.0250 NaN S 0
395 1287 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 1
396 1288 3 Colbert, Mr. Patrick male 24.0 0 0 371109 7.2500 NaN Q 0
397 1289 1 Frolicher-Stehli, Mrs. Maxmillian (Margaretha ... female 48.0 1 1 13567 79.2000 B41 C 1
398 1290 3 Larsson-Rondberg, Mr. Edvard A male 22.0 0 0 347065 7.7750 NaN S 0
399 1291 3 Conlon, Mr. Thomas Henry male 31.0 0 0 21332 7.7333 NaN Q 0
400 1292 1 Bonnell, Miss. Caroline female 30.0 0 0 36928 164.8667 C7 S 1
401 1293 2 Gale, Mr. Harry male 38.0 1 0 28664 21.0000 NaN S 0
402 1294 1 Gibson, Miss. Dorothy Winifred female 22.0 0 1 112378 59.4000 NaN C 1
403 1295 1 Carrau, Mr. Jose Pedro male 17.0 0 0 113059 47.1000 NaN S 0
404 1296 1 Frauenthal, Mr. Isaac Gerald male 43.0 1 0 17765 27.7208 D40 C 0
405 1297 2 Nourney, Mr. Alfred (Baron von Drachstedt")" male 20.0 0 0 SC/PARIS 2166 13.8625 D38 C 0
406 1298 2 Ware, Mr. William Jeffery male 23.0 1 0 28666 10.5000 NaN S 0
407 1299 1 Widener, Mr. George Dunton male 50.0 1 1 113503 211.5000 C80 C 0
408 1300 3 Riordan, Miss. Johanna Hannah"" female NaN 0 0 334915 7.7208 NaN Q 1
409 1301 3 Peacock, Miss. Treasteall female 3.0 1 1 SOTON/O.Q. 3101315 13.7750 NaN S 1
410 1302 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q 1
411 1303 1 Minahan, Mrs. William Edward (Lillian E Thorpe) female 37.0 1 0 19928 90.0000 C78 Q 1
412 1304 3 Henriksson, Miss. Jenny Lovisa female 28.0 0 0 347086 7.7750 NaN S 1
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S 0
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C 1
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S 0
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S 0
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C 0

418 rows × 12 columns

When pretreating data, you will want to wrap your changes inside a function, this will help you keep track of your changes and easily compare them.

In [7]:
def pretreat(df):
    if 'Survived' in df.columns:
        df['Survived'] = df.apply(lambda row: forBoolean(row, 'Survived'), axis=1).dropna()
    df['Age'] = df.apply(forAge, axis=1).dropna()
    df['SibSp'] = df.apply(lambda row: forBoolean(row, 'SibSp'), axis=1).dropna()
    df['Parch'] = df.apply(lambda row: forBoolean(row, 'Parch'), axis=1).dropna()
    df['Sex'] = df.apply(forGender, axis=1).dropna()
    droped_cols = [col for col in ['PassengerId', 'Name', 'Ticket', 'Fare', 'Cabin'] if col in df.columns]
    df = df.drop(droped_cols, axis=1)
    df = df.rename(index=str, columns={'Sex': 'Gender', 'SibSp': 'Siblings', 'Parch': 'Parents'})
    return df

traindf = pandas.read_csv(os.path.join('res', 'titanic', 'train.csv'))
testdf  = pandas.merge(pandas.read_csv(os.path.join('res', 'titanic', 'test.csv')),
                       pandas.read_csv(os.path.join('res', 'titanic', 'gender_submission.csv')),
                       on="PassengerId")
traindf = pretreat(traindf)
testdf = pretreat(testdf)

We will need to save this intermediate learning database, since pyAgrum accepts only files as inputs. As a rule of thumb, save your CSV using comma as separators and do not quote values when you plan to use them with pyAgrum.

In [8]:
import csv
traindf.to_csv(os.path.join('res', 'titanic', 'post_train.csv'), index=False)
testdf.to_csv(os.path.join('res', 'titanic', 'post_test.csv'), index=False)

Modeling withtout learning

In some cases, we might not have any data to learn from. In such cases, we can rely on experts to provide correlation between variables and conditional probabilities.

It can be simpler to start with a simple topography, leaving room to add more complexe correlations as the model is confonted aginst data. Here, we will use three hypothesis:

  • All variables are independent conditionnaly to each other given the fact that a passenger has survive or not.
  • Women and children are more likelly to survive.
  • The more sibling or parents abord, the less likelly the passenger will survive.

The first assumption results in the folowing DAG for our Bayesian Network:

In [9]:
bn = gum.BayesNet("Surviving Titanic")

survived = gum.LabelizedVariable('Survived','Did the passenger survived ?',0)
survived.addLabel("False")
survived.addLabel("True")
bn.add(survived)
print(survived)

age = gum.LabelizedVariable('Age', "The passenger's age category", 0)
age.addLabel("baby")
age.addLabel("toddler")
age.addLabel("kid")
age.addLabel("teen")
age.addLabel("adult")
age.addLabel("old")
bn.add(age)
print(age)

gender = gum.LabelizedVariable('Gender', "The passenger's gender", 0)
gender.addLabel("Female")
gender.addLabel("Male")
bn.add(gender)
print(gender)

siblings = gum.LabelizedVariable('Siblings', "Did the passenger had siblings aboard ?", 0)
siblings.addLabel("False")
siblings.addLabel("True")
bn.add(siblings)
print(siblings)

parents = gum.LabelizedVariable('Parents', "The passenger had parents aboard ?", 0)
parents.addLabel("False")
parents.addLabel("True")
bn.add(parents)
print(parents)

bn.addArc('Survived', 'Age')
bn.addArc('Survived', 'Gender')
bn.addArc('Survived', 'Siblings')
bn.addArc('Survived', 'Parents')

bn
Survived<False,True>
Age<baby,toddler,kid,teen,adult,old>
Gender<Female,Male>
Siblings<False,True>
Parents<False,True>
Out[9]:
G Survived Survived Age Age Survived->Age Gender Gender Survived->Gender Siblings Siblings Survived->Siblings Parents Parents Survived->Parents

Hypothesis two and three can help us define the parameters for this Bayesian Network. Remember that we assume that we do not have any data to learn from. So we will use simple definition such as "a women is 10 times more likeliy to survive than a man". We can then normalize the values to obtain a proper conditional probability distribution.

This technique may not be the most precise or scientifically sounded, it however has the advantage to be easy to use.

In [10]:
bn.cpt('Survived')[:] = [100, 1]
bn.cpt('Survived').normalizeAsCPT()
bn.cpt('Survived')
Out[10]:
Survived
False
True
0.99010.0099
In [11]:
bn.cpt('Age')[0:] = [ 1, 1, 1, 10, 10, 1]
bn.cpt('Age')[1:] = [ 10, 10, 10, 1, 1, 10]
bn.cpt('Age').normalizeAsCPT()
bn.cpt('Age')
Out[11]:
Age
Survived
baby
toddler
kid
teen
adult
old
False
0.04170.04170.04170.41670.41670.0417
True
0.23810.23810.23810.02380.02380.2381
In [12]:
bn.cpt('Gender')[0:] = [ 1, 1]
bn.cpt('Gender')[1:] = [ 10, 1]
bn.cpt('Gender').normalizeAsCPT()
bn.cpt('Gender')
Out[12]:
Gender
Survived
Female
Male
False
0.50000.5000
True
0.90910.0909
In [13]:
bn.cpt('Siblings')[0:] = [ 1, 10]
bn.cpt('Siblings')[1:] = [ 10, 1]
bn.cpt('Siblings').normalizeAsCPT()
bn.cpt('Siblings')
Out[13]:
Siblings
Survived
False
True
False
0.09090.9091
True
0.90910.0909
In [14]:
bn.cpt('Parents')[0:] = [ 1, 10]
bn.cpt('Parents')[1:] = [ 10, 1]
bn.cpt('Parents').normalizeAsCPT()
bn.cpt('Parents')
Out[14]:
Parents
Survived
False
True
False
0.09090.9091
True
0.90910.0909

Now we can start using the Bayesian Network and check that our hypothesis hold.

In [15]:
gnb.showInference(bn,size="10")
structs Inference in   0.74ms Survived Age Survived->Age Gender Survived->Gender Siblings Survived->Siblings Parents Survived->Parents

We can see here that most passengers (99% of them) will not survive and that we have almost as much women (50.4%) as men (49.6%). The majority of passengers are either teenagers or adults. Finally, most passenger had siblings or parents aboard.

Recall that we have not use any data to learn the Bayesian Netork's parameters and our expert did not have any knowledge about the passengers aboard the Titanic.

In [16]:
gnb.showInference(bn,size="10", evs={'Survived':'False'})
gnb.showInference(bn,size="10", evs={'Survived':'True'})
structs Inference in   0.72ms Survived Age Survived->Age Gender Survived->Gender Siblings Survived->Siblings Parents Survived->Parents
structs Inference in   0.49ms Survived Age Survived->Age Gender Survived->Gender Siblings Survived->Siblings Parents Survived->Parents

Here, we can see that our second and third hypothesis hold since when we enter envidence that a passenger survived, it is more likely to be a woman with no siblings or parents. On the contrary, if we observe that a passenger did not survive we can see that it is more likely to be a man with siblings or parents.

In [17]:
gnb.showInference(bn,size="10", evs={'Survived':'True', 'Gender':'Male'})
gnb.showInference(bn,size="10", evs={'Gender':'Male'})
structs Inference in   0.77ms Survived Age Survived->Age Gender Survived->Gender Siblings Survived->Siblings Parents Survived->Parents
structs Inference in   0.74ms Survived Age Survived->Age Gender Survived->Gender Siblings Survived->Siblings Parents Survived->Parents

This validates our first hypothesis: if we know that a passenger survived or not, then evidence about that passenger does not changes our belief about other variables. On the contrary, if we do not know if a passenger survived, then evidence about the passenger will change our belief about other variables, including the fact that he or she survived or not.

In [18]:
ie=gum.LazyPropagation(bn)

def init_belief(engine):
    # Initialize evidence
    for var in engine.BN().names():
        if var != 'Survived':
            engine.addEvidence(var, 0)

def update_beliefs(engine, bayesNet, row):
    # Update beliefs from a given row less the Survived variable
    for var in bayesNet.names():
        if var == "Survived":
            continue
        try:
            label = str(row.to_dict()[var])
            idx = bayesNet.variable(var).index(str(row.to_dict()[var]))
            engine.chgEvidence(var, idx)
        except gum.NotFound:
            # this can happend when value is missing is the test base.
            pass        
    engine.makeInference()
    
def is_well_predicted(engine, bayesNet, auc, row):
    update_beliefs(engine, bayesNet, row)
    marginal = engine.posterior('Survived')
    outcome = row.to_dict()['Survived']
    if outcome == "False": # Did not survived
        if marginal.toarray()[1] < auc:
            return "True Positive"
        else:
            return "False Negative"
    else: # Survived
        if marginal.toarray()[1] >= auc:
            return "True Negative"
        else:
            return "False Positive"

init_belief(ie)
ie.addTarget('Survived')
result = testdf.apply(lambda x: is_well_predicted(ie, bn, 0.5, x), axis=1)
result.value_counts(True)
Out[18]:
True Positive     0.516746
False Positive    0.322967
False Negative    0.119617
True Negative     0.040670
dtype: float64
In [19]:
positives = sum(result.map(lambda x: 1 if x.startswith("True") else 0 ))
total = result.count()
print("{0:.2f}% good predictions".format(positives/total*100))
55.74% good predictions

This first model achieve a 55% of good predictions, not a good result but we have plenty of room to improve it.

Pre-learning

We will now learn a Bayesian Network from the training set without any prior knowledge about shipwreks.

Before learning a Bayesian Network, we first need to create a template. This is not mandatory, however it is sometimes usefull since not all varaibles values are present in the learning base (in this example the number of relatives).

If during the learning step, the algorithm encounters an unknown value it will raise an error. This would be an issue if we wanted to automitize our classifier but, we will directly use values working with the test and learning base. This is not ideal but the objective here it to explore the data fast, not thoroughly.

To help creating de the template Bayesian Network that we will use to learn our classifier, let us firt recall all the variables wa have at our disposal.

In [20]:
df = pandas.read_csv(os.path.join('res', 'titanic', 'post_train.csv'))
for k in traindf.keys():
    print('{0}: {1}'.format(k, len(traindf[k].unique())))
Survived: 2
Pclass: 3
Gender: 2
Age: 6
Siblings: 2
Parents: 2
Embarked: 4

From here, creating the BayesNet is straitforward: for each variable we either use the RangeVariable class or the LabelizedVariable.

The RangeVariable class creates a discrete random variable over a range. With the LabelizedVariable you will need to add each label ony by one. Note however that you can pass an argument to create as much labels starting from 0.

In [21]:
template=gum.BayesNet()
template.add(gum.LabelizedVariable("Survived", "Survived", ['False', 'True']))
template.add(gum.RangeVariable("Pclass", "Pclass",1,3))
template.add(gum.LabelizedVariable("Gender", "The passenger's gender",['Female', 'Male']))
template.add(gum.LabelizedVariable("Siblings", "Siblings",['False', 'True']))
template.add(gum.LabelizedVariable("Parents", "Parents",['False', 'True']))
template.add(gum.LabelizedVariable("Embarked", "Embarked", ['', 'C', 'Q', 'S']))
template.add(gum.LabelizedVariable("Age", "The passenger's age category", ["baby", "toddler", "kid", "teen", "adult", "old"]))             
gnb.showBN(template)
G Survived Survived Pclass Pclass Gender Gender Siblings Siblings Parents Parents Embarked Embarked Age Age

You can also let the learning algorithm create the BayesNet random variables. However please be aware that the algorithm will no be able to handle values absent from the learning database.

Learning

We can now learn our first Bayesian Network. As you will see, this is really easy.

In [22]:
file = os.path.join('res', 'titanic', 'post_train.csv')
learner = gum.BNLearner(file, template)
bn = learner.learnBN()
bn
Out[22]:
G Survived Survived Pclass Pclass Pclass->Survived Siblings Siblings Pclass->Siblings Gender Gender Gender->Survived Siblings->Gender Parents Parents Parents->Gender Parents->Siblings Age Age Parents->Age Embarked Embarked Embarked->Pclass Age->Embarked

In a notebook, a Bayesian Network will automatically be shown graphically, you can also use the helper function gnb.showBN(bn).

Exploring the data

Now that we have a BayesNet, we can start looking how the variables corelate with each other. pyAgum offer the perfect tool for that: the information graph.

In [23]:
gnb.showInformation(bn,{},size="20")
G Survived Survived Pclass Pclass Pclass->Survived Siblings Siblings Pclass->Siblings Gender Gender Gender->Survived Siblings->Gender Parents Parents Parents->Gender Parents->Siblings Age Age Parents->Age Embarked Embarked Embarked->Pclass Age->Embarked
0.79395730619197811.6720972518418398

To read this graph, you must understand what the entropy of a variable means: the hightest the value the more uncertain the variable marginal probability distrubition is (maximum entropy beging the equiprobable law). The lowest the value is, the more /certain/ the law is.

A consequence of how entropy is calculated, is that entropy tends to get bigger if the random varaible has many modalities.

What the information graph tells us is that the decade variable has a hight entropy. Thus, we can conclude that the passengers decade is distributed between all of its modalities.

What it also tells us, it that high modality variables with low entropy, such as Parch or SibSp, are not evenly distributed.

Let us look at he variables marginal probability by using the showInference() function.

In [24]:
gnb.showInference(bn)
structs Inference in   0.83ms Survived Pclass Pclass->Survived Siblings Pclass->Siblings Gender Gender->Survived Siblings->Gender Parents Parents->Gender Parents->Siblings Age Parents->Age Embarked Embarked->Pclass Age->Embarked

The showInference() is really usefull as it shows the marginal probability distribution for each random variable of a BayesNet.

We can now confirm what the entropy learned us: Parch and SibSp are unevenly distributed and decade is more evenly distributed.

Lets focus on the Kaggle challenge now, and look at the Survived variable. We show a single posterior using the showPosterior() function.

In [25]:
gnb.showPosterior(bn,evs={},target='Survived')

So more than 40% of the passenger in our learning database survived.

So how can we use this BayesNet as a classifier ? Given a set of evidence, we can infer an update posterio distribution of the target variable Survived.

Lets look at the odds of surviving as a man in his thirties.

In [26]:
gnb.showPosterior(bn,evs={"Gender": "Male", "Age": 'adult'},target='Survived')

And now the odds of an old lady to survive.

In [27]:
gnb.showPosterior(bn,evs={"Gender": "Female", "Age": 'old'},target='Survived')

Well, children and ladies first, that's right ?

One last information we will need is which variables are required to predict the Survived variable. To do, we will use the markov blanket of Survived.

In [28]:
gnb.sideBySide(bn, gum.MarkovBlanket(bn, 'Survived'), captions=["Learned Bayesian Network", "Markov blanket of 'Survived'"])
G Survived Survived Pclass Pclass Pclass->Survived Siblings Siblings Pclass->Siblings Gender Gender Gender->Survived Siblings->Gender Parents Parents Parents->Gender Parents->Siblings Age Age Parents->Age Embarked Embarked Embarked->Pclass Age->Embarked
no_name 0 Survived 1 Pclass 1->0 2 Gender 2->0
Learned Bayesian Network
Markov blanket of 'Survived'

The Markov Blanket of the Survived variable tells us that we only need to observe Sex and Pclass in order to predict Survived. Not really usefull here but on larger Bayesian Networks it can save you a lot of time and CPU.

So how to use this BayesNet we have learned as a classifier ? We simply infer the posterior the Survive variable given the set of evidence we are given, and if the passanger odds of survival are above some value he will be taged as a survivor.

To compute the best value given the BayesNet and our training database, we can use the showROC() function.

In [29]:
showROC(bn, os.path.join('res', 'titanic', 'post_train.csv'), 'Survived', 'True', True, True)
 res/titanic/post_train.csv : [ ############################################## ] 100%
 result in res/titanic/post_train.csv-ROC_unnamed-Survived-True.png