lol I started the difficult course of Probabilistic Graphic Models on Coursera.
If I could keep going…yes I will.
I am planning to cover the whole 9 weeks following the schedule. Maybe not going to do all the assignments. I will post on some of the key points but not all of them in my blog. Currently, I am reading DL textbooks and blogs, and working on other projects as well.
Flow of Prob Influence
Bayesian Networks are directed graphs. It is natural that you can think of the Causality, like X -> Y, which means X is the cause and Y is the effect (X leads to Y). We will use this notation to mark.
In the example given here.
When can X influence Y?
Here we assume we can not observe W.
We define an active trail:
But what if W can be observed? Or W’s children can be observed?
If Letter can be observed, then we would say if prob of Difficulty is changed, then it can influence the prob of Intelligence.
Naïve Bayes Model
NBM is efficient in domains with many weakly relevant features, and very easy to construct.
Consider a classifier example:
If the sample belongs to “Cat” class, then it has following features: fluffy, 4 legs, a tail…
“Cat” is the one who decide the other features, those features could be observed. And the classifier is to infer the class (is hidden) from those observed features. All the features are independent, given the condition of Class:
(Applying Chain Rule)
And it is the Naive Bayes Classifier.
Bernoulli Naive Bayes for Text
Considering a document classifier. A document contains a list of words, we infer the class label from those words. Say a financial document has a higher probably of containing “buy” or “sell”, rather than “cat” and “dog”, shown in the top right CPD.The words are document features, and the states are all binary, that is appear or not.
In this model, we will have the same number of features with the whole dictionary, regardless how long the document is.
Multinomial Naive Bayes for Text
Another similar model is illustrated below:
The features are words. But the number of features is depending on the size of a document. W1 stands for the 1st word, Wn stands for the last word. For each word, the number of state is the same size with the whole dictionary. So basically all features are sharing a same CPD. For each row in the CPD, sums up to 1.
Markov Assumption
Let’s think about dynamic events. In the real word, things are happening in an order, like the weather forecast, stock prices, etc. Those data has timestamps, or in data science we use “time-seise” data.
We use to denote the variable X at time t.
represents for the states from X at time t to X at time t’. This is a series.
We want to represent for any given two time points t and t’.
shows when time flows forward.
is the initial state. The conditional probability, , denotes the state at t+1, given the condition of the states of all past states.
We assume that this situation exists:
where when given the condition of the present state, the next step and the past steps are independent.
An example is shown here:
This is a weather model, we assume the weather on Monday could affect directly on the weather on Tuesday, and so on. Given the Markov Assumption, we say if today Tuesday can be observed, then Monday and Wednesday are independent, which means if you want to know the weather for Wednesday, it can only be inferred from Tuesday, and has nothing to do with Monday.
But if Tuesday can not be observed, then Monday and Wednesday are not conditionally independent.
Hidden Markov Models
If we go further with our weather model:
We will have yellow nodes that could be observed.
More generalized, there is a presentation of the model:
From S to S’, there is a transaction matrix, showing probabilities from one state to another.
Here is an example illustrating a word HMM.
From the start state 0, to a whole progress of “nine”.