PGM 01: Bayesian Networks

lol I started the difficult course of Probabilistic Graphic Models on Coursera.

If I could keep going…yes I will.
I am planning to cover the whole 9 weeks following the schedule. Maybe not going to do all the assignments. I will post on some of the key points but not all of them in my blog. Currently, I am reading DL textbooks and blogs, and working on other projects as well.

Flow of Prob Influence

Bayesian Networks are directed graphs. It is natural that you can think of the Causality, like X -> Y, which means X is the cause and Y is the effect (X leads to Y). We will use this notation to mark.

In the example given here.
When can X influence Y?

12921175_777701205698783_1753333414_n

Here we assume we can not observe W.

We define an active trail:
1

But what if W can be observed? Or W’s children can be observed?

12910923_777701325698771_1344758317_n

2

If Letter can be observed, then we would say if prob of Difficulty is changed, then it can influence the prob of Intelligence.

Naïve Bayes Model

NBM is efficient in domains with many weakly relevant features, and very easy to construct.

Consider a classifier example:

12939549_777698175699086_451308624_n
If the sample belongs to “Cat” class, then it has following features: fluffy, 4 legs, a tail…
“Cat” is the one who decide the other features, those features could be observed. And the classifier is to infer the class (is hidden) from those observed features. All the features are independent, given the condition of Class:

12919456_777703562365214_2051728393_o

(Applying Chain Rule)
And it is the Naive Bayes Classifier.

Bernoulli Naive Bayes for Text
12921849_777703275698576_256148912_o
Considering a document classifier. A document contains a list of words, we infer the class label from those words. Say a financial document has a higher probably of containing “buy” or “sell”, rather than “cat” and “dog”, shown in the top right CPD.The words are document features, and the states are all binary, that is appear or not.
In this model, we will have the same number of features with the whole dictionary, regardless how long the document is.

Multinomial Naive Bayes for Text

Another similar model is illustrated below:
3
The features are words. But the number of features is depending on the size of a document. W1 stands for the 1st word, Wn stands for the last word. For each word, the number of state is the same size with the whole dictionary. So basically all features are sharing a same CPD. For each row in the CPD, sums up to 1.

Markov Assumption

Let’s think about dynamic events. In the real word, things are happening in an order, like the weather forecast, stock prices, etc. Those data has timestamps, or in data science we use “time-seise” data.

We use { X }^{ (t) } to denote the variable X at time t.
{ X }^{ (t: t' ) } represents for the states from X at time t to X at time t’. This is a series.
We want to represent { P(X }^{ (t:t') }) for any given two time points t and t’.

{ P(X }^{ (0:T) })\quad =\quad P({ X }^{ 0 })\prod _{ t=0 }^{ t=T-1 }{ P({ X }^{ T+1 }|{ X }^{ 0:T }) }
P({ X }^{ 0:T }) shows when time flows forward. P({ X }^{ 0 })
is the initial state. The conditional probability, P({ X }^{ T+1 }|{ X }^{ 0:T }), denotes the state at t+1, given the condition of the states of all past states.
We assume that this situation exists:
 { (X }^{ (t+1) }\bot { \quad X }^{ (0:t-1) }|{ \quad X }^{ (t) })
where when given the condition of the present state, the next step and the past steps are independent.

An example is shown here:
4.png
This is a weather model, we assume the weather on Monday could affect directly on the weather on Tuesday, and so on. Given the Markov Assumption, we say if today Tuesday can be observed, then Monday and Wednesday are independent, which means if you want to know the weather for Wednesday, it can only be inferred from Tuesday, and has nothing to do with Monday.
But if Tuesday can not be observed, then Monday and Wednesday are not conditionally independent.

Hidden Markov Models

If we go further with our weather model:
13010078_783012968500940_1097103955_o
We will have yellow nodes that could be observed.
More generalized, there is a presentation of the model:
12969195_783013575167546_788698926_n

From S to S’, there is a transaction matrix, showing probabilities from one state to another.

Here is an example illustrating a word HMM.
12966633_783013815167522_709583820_n
From the start state 0, to a whole progress of “nine”.

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: