Last Updated On: 2020-07-11 16:31
- Type of Autoencoders
- Vanilla Encoder Decoder model and its variants
- Regularized Autoencoders
- Sparse Autoencoders
- Denoising Autoencoders
- Contractive Autoencoders
- Variational Autoencoders
- Conditional Variational Autoencoders
- Adversarial Autoencoders
Vanilla Autoencoder model and its variants
-
Simple Encoder Decoder models.
-
Condensed vector representation of Input.
-
Autoencoder without activation function and mean squared error as loss function is same as PCA(Principal Component Analysis).
-
Sometimes the optimal model comes out to be a linear model(PCA), means without activation functions {it can be proved by using singular value decomposition}.
-
Undercomplete and overcomplete Autoencoder
-
One Layer Encoder
h=g(a(X))=σ(b(1)+W(1)X)
-
One Layer Decoder
X^=o(a^(X))=σ(b(2)+W(2)X)
-
Loss function and its derivative
L(y^,y)∇L(y^,y)=n1i=1∑n(yi^−yi)2=y^−y(1)
L(y^,y)∇L(y^,y)=−i=1∑n(ylog(y^)+(1−y)log(1−y^))=(1−y^)(y)y^−y(2)
(1) mean squared error, maximum likelihood, sum squared error, or Squared Euclidean Distance
(2) Bernoulli negative log-likelihood, or Binary Cross-Entropy
-
Backpropogation(Weight Updation)
- L1 and L2 Regularization
- L1 Regularization : L(y^,y)=n1∑i=1n(yi^−yi)2+λ∣θ∣
- L2 Regularization : L(y^,y)=n1∑i=1n(yi^−yi)2+λ∣θ∣2
- DAEs take a partially corrupted input and are trained to recover the original undistorted input
- Stocastic mapping procedure is used to corrupt the data
-
Sparse autoencoder may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at once.
-
Loss Function
L(y^,y)=n1i=1∑n(yi^−yi)2+Ω(h)(3)
Ω(h) : sparsity penalty
-
Loss Function
L(y^,y)=n1i=1∑n(yi^−yi)2+λ∑∣∣∇yhi∣∣2(3)
λ∑∣∣∇yhi∣∣2 : Frobenius Norm
- Expectation Value
- Monte Carlo Estimation of Expectation
- Bayes Theorem
- Naive Bayes Algorithm and its trainig strategies
- Variational Inference
- KL divergence
- Expectation or Expected value of some random variable X with outcomes(finite) as {x1,x2,...,xk} with probabilities as {px1,px2,...,pxk}
E(X)=i∑kxipxi
E(X∣Z)=∣Z∣∑z∈ZX(z)∀Z={z1,z2,...zk}
- ∣Z∣ is called cardinality of Z
-
More elaborate guide is here
-
In probability theory and statistics, Bayes' theorem (alternatively Bayes's theorem, Bayes's law or Bayes's rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the even [wikipedia defination]
-
Bayes theorem gives the measure of the probability of a hypothesis z, given some new data x.
p(z∣x)=p(x)p(x∣z)p(z)
- Its Derivation
p(x∩y)p(x∣y)p(y)p(x∣y)=p(y∩x)=p(y∣x)p(x)=p(y)p(y∣x)p(x)
Naive Bayes Algorithm and its training strategies
Ref : http://www.cs.cornell.edu/courses/cs4780/2018sp/lectures/lecturenote05.html
-
Bayesian Classification : Lets assume that L={l1,l2,...,lk} labels has to be predicted using a data point X={x1,x2,X3,...,xn}, hence the probalitistic output(i.e class score) of the event is,
P(L∣X)
And, According to the Bayes Theorem.
P(L∣X)=P(X)P(X∣L)P(L)
-
So, if we some how calculate the value of P(X∣L) then the problem is solved. Such type of probilistic model is called generative modeling in which we are generating data given its label. In general such training is very difficult.
-
First naive bayes assumption: feature values(X) are independent given the label(L).
P(Xi∣L,{x2,x3,...,xn})=P(Xi∣L)P(X∣L)=i∏nP(xi∣L)
-
Hence, the Problem of simplified to maximizing posterior with above assumption become,
h(X)=largmaxP(L∣X)=largmax(P(X)P(X∣L)P(L))=largmax(P(X)∏i=1nP(xi∣L)P(L))=largmaxi∑n(logP(xi)+logP(L)−logP(X))=largmaxi∑n(logP(xi)+logP(L))−i∑nlogP(X)=largmaxi∑n(logP(xi)+logP(L))
-
But when the Sample points belongs to the any random distribution then the problem becomes intractable. But, If we assumne a definate distribution for the input data like Gaussian distribution then model data point by calculation mean and variation particular to the labels.
-
Variational Autoencoder has bases lies in Naive bayes Algorithm
- Problem with the Bayes Theorem: the denominator term in the equation is intractable as computer has to perform integrals for the all values in z.
p(z∣x)=∫zp(x∣z)p(z)dzp(x∣z)p(z)∀z={z1,z2,z3,...}
-
So the way around is to rather Approximate it by assuming a function q(z∣x) such that,
p(z1,z2,z3,...∣X)∼q(z1,z2,z3,...∣X)
Where q(z∣x) is tractable while p(z∣x) is not.
-
Then, find the setting of the parameters that makes the choosen distribution close to the posterior of interest.
-
One method to achieve the variational inference is KL divergence
-
KL divergence measures the closeness of two distributions.
Higher the probablity of an event, lower will be its information content. i.e information is inversly proportional to the probability of the event.
For example : Lower temperature at any day on june(summer) is less likely but it has high information content about the weather conditions of that day.
-
Information Content of an event X w.r.t some probabilty distribution {p,q}.
Ip(x)=−log(p(x))Iq(x)=−log(q(x))
-
KL divergence is the Expectation value of the change in information content
DKL(q(x)∣∣p(x))=E∼q[ΔI]=∫q(x)log(p(x)q(x))dx
-
Kullback-Leibler (KL) is not symmetric.
DL(q(x)∣∣p(x))=DL(p(x)∣∣q(x))
-
Evidence Lower Bound (ELBO) :
- We actually can’t minimize the KL divergence exactly, but we can minimize a function that is equal to it up to a constant.
-
As naive bayes says that computation of p(z)p(x∣z) is a generative modelling problems so, starting from that,
-
Assume pθ(z)pθ(x∣z) as generative model, but for the computation of the equation there is a requirement of posterior pθ(z∣x) which is forms the equation.
pθ(z∣x)=pθ(x)pθ(z)pθ(x∣z)(1)
-
Due to pθ(z∣x)=∫zp(x∣z)p(z)dz which is intractable whole equation 1 become intractable.
-
So using variational inference, assume a variational Approximation qϕ(z∣x) (Which works as a recognition model) for the pθ(z∣x).
-
Now there are method to find the parameters ϕ for qϕ(z∣x) called Variational Bayesian methods, like mean-field Variational inference, Factorized approximation, KL divergence.
-
But in VAE parameters ϕ for qϕ(z∣x) are learned jointly with the generative model parameters θ.
L(ϕ,θ)=−E∼qlog(p(x∣Z))+DKL(qϕ(Z∣x)∣∣pθ(Z))
DKL(qϕ(Z∣x)∣∣pθ(Z∣x))DKL(qϕ(Z∣x)∣∣pθ(Z∣x))−log(pθ(x))log(pθ(x))−DKL(qϕ(Z∣x)∣∣pθ(Z∣x))=Eqϕ(Z∣x)(Ip−Iq)=Eqϕ(Z∣x)[log(qϕ(Z∣x))−log(pθ(Z∣x))]=Eqϕ(Z∣x)[log(qϕ(Z∣x))−log(pθ(x)pθ(x∣Z)pθ(z))]=Eqϕ(Z∣x)[log(qϕ(Z∣x))−log(pθ(x∣Z))−log(pθ(Z))+log(pθ(x))]=Eqϕ(Z∣x)[log(qϕ(Z∣x))−log(pθ(x∣Z))−log(pθ(Z))]=−Eqϕ(Z∣x)log(pθ(x∣Z))+Eqϕ(Z∣x)[log(qϕ(Z∣x))−log(pθ(Z))]=−Eqϕ(Z∣x)log(pθ(x∣Z))+DKL(qϕ(Z∣x)∣∣pθ(Z))=Eqϕ(Z∣x)log(pθ(x∣Z))−DKL(qϕ(Z∣x)∣∣pθ(Z))
NOW : DKL(qϕ(Z∣x)∣∣pθ(Z∣x))≥0
Proof : //TODO
- On combining above two equations
log(pθ(x))≥Eqϕ(Z∣x)log(pθ(x∣Z))−DKL(qϕ(Z∣x)∣∣pθ(Z))
- The right hand side of the equation is the Evidence Lower Bound (ELBO) :
- Therefore maximizing the ELBO maximizes the log probability of our data by proxy. This is the core idea of variational inference
- Kullback-Leibler term in the ELBO is a regularizer because it is a constraint on the form of the approximate posterior.
- second term is called a reconstruction term because it is a measure of the likelihood of the reconstructed data output at the decoder.
- Negation of the ELBO is the loss function for VAE
-
For wait upgradation there is need of ∇L(ϕ,θ) which is ∇ϕL(ϕ,θ) taking θ as constant and ∇θL(ϕ,θ) taking ϕ as constant
-
Derivation w.r.t θ is OK but w.r.t ϕ is problematic because of the Eqϕ(Z∣x)log(pθ(x∣Z)) as the Expectation depends on the variable ϕ
-
Using Monte Carlo estimation
log(pθ(x))≥L1i=1∑Llog(pθ(x∣Z))−DKL(qϕ(Z∣x)∣∣pθ(Z))
Matrix Maths and Training
Full Implementation of different Autoencoder is Available on my kaggle account link is here
- Chapter 14: Autoencoders {Deep Learning; Ian Goodfellow}.