R語(yǔ)言實(shí)現(xiàn):混合正態(tài)分布EM最大期望估計(jì)法
原文鏈接:http://tecdat.cn/?p=4815
?
因?yàn)榻谠诜治鰯?shù)據(jù)時(shí)用到了EM最大期望估計(jì)法這個(gè)算法,在參數(shù)估計(jì)中也用到的比較多。然而,發(fā)現(xiàn)國(guó)內(nèi)在R軟件上實(shí)現(xiàn)高斯混合分布的EM的實(shí)例并不多,大多數(shù)是關(guān)于1到2個(gè)高斯混合分布的實(shí)現(xiàn),不易于推廣,因此這里分享一下自己編寫(xiě)的k個(gè)高斯混合分布的EM算法實(shí)現(xiàn)請(qǐng)大神們多多指教。并結(jié)合EMCluster包對(duì)結(jié)果進(jìn)行驗(yàn)算。
??????本文使用的密度函數(shù)為下面格式:

???對(duì)應(yīng)的函數(shù)原型為 em.norm(x,means,covariances,mix.prop)
x為原數(shù)據(jù),means為初始均值,covariances為數(shù)據(jù)的協(xié)方差矩陣,mix.prop為混合參數(shù)初始值。
使用的數(shù)據(jù)為MASS包里面的synth.te數(shù)據(jù)的前兩列
x <- synth.te[,-3]
首先安裝需要的包,并讀取原數(shù)據(jù)。
install.packages("MASS")
library(MASS)
install.packages("EMCluster")
library(EMCluster)
install.packages("ggplot2")
library(ggplot2)
Y=synth.te[,c(1:2)]
qplot(x=xs, y=ys, data=Y)?
然后繪制相應(yīng)的變量相關(guān)圖:

從圖上我們可以大概估計(jì)出初始的平均點(diǎn)為(-0.7,0.4) (-0.3,0.8)(0.5,0.6)
當(dāng)然 為了試驗(yàn)的嚴(yán)謹(jǐn)性,我可以從兩個(gè)初始均值點(diǎn)的情況開(kāi)始估計(jì)
首先輸入初始參數(shù):
mustart = rbind(c(-0.5,0.3),c(0.4,0.5)) ???
covstart = list(cov(Y), cov(Y))
probs = c(.01, .99)
?
然后編寫(xiě)em.norm函數(shù),注意其中的clusters值需要根據(jù)不同的初始參數(shù)進(jìn)行修改,
em.norm = function(X,mustart,covstart,probs){
??params = list(mu=mustart, var=covstart, probs=probs) ??
??clusters = 2?
??tol=.00001
??maxits=100
??showits=T
??require(mvtnorm)
??N = nrow(X)
??mu = params$mu
??var = params$var
??probs = params$probs
??ri = matrix(0, ncol=clusters, nrow=N) ????????
??ll = 0 ???????????????????????????????????????
??it = 0 ????????????????????????????????????????
??converged = FALSE ???????????????????????????
??if (showits) ????????????????????????????????
????cat(paste("Iterations of EM:", "\n"))
??while (!converged & it < maxits) {?
????probsOld = probs
????
????llOld = ll
????riOld = ri
????
???
????# Compute responsibilities
????for (k in 1:clusters){
??????ri[,k] = probs[k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=F)
????}
????ri = ri/rowSums(ri)
????
??
????rk = colSums(ri) ????????????????????????????
????probs = rk/N
????for (k in 1:clusters){
??????varmat = matrix(0, ncol=ncol(X), nrow=ncol(X)) ????????
??????for (i in 1:N){
????????varmat = varmat + ri[i,k] * X[i,]%*%t(X[i,])
??????}
??????mu[k,] = (t(X) %*% ri[,k]) / rk[k]
??????var[[k]] = ?varmat/rk[k] - mu[k,]%*%t(mu[k,])
??????ll[k] = -.5 * sum( ri[,k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=T) )
????}
????ll = sum(ll)
????
?????
????parmlistold = ?c(llOld, probsOld) ???????????
????parmlistcurrent = c(ll, probs) ????????????
????it = it + 1
????if (showits & it == 1 | it%%5 == 0) ????????
??????cat(paste(format(it), "...", "\n", sep = ""))
????converged = min(abs(parmlistold - parmlistcurrent)) <= tol
??}
??
??clust = which(round(ri)==1, arr.ind=T) ??????
??clust = clust[order(clust[,1]), 2] ??????????
??out = list(probs=probs, mu=mu, var=var, resp=ri, cluster=clust, ll=ll)
}?
結(jié)果,可以用圖像化來(lái)表示:
qplot(x=xs, y=ys, data=Y)?
ggplot(aes(x=xs, y=ys), data=Y) +
???geom_point(aes(color=factor(test$cluster)))?


?
?
?類(lèi)似的其他情況這里不呈現(xiàn)了,另外r語(yǔ)言提供了EMCluster包可以比較方便的實(shí)現(xiàn)EM進(jìn)行參數(shù)估計(jì)和結(jié)果的誤差分析。
ret <- init.EM(Y, nclass = 2)
em.aic(x=Y,emobj=list(pi = ret$pi, Mu = ret$Mu, LTSigma = ret$LTSigma))#計(jì)算結(jié)果的AIC
通過(guò)比較不同情況的AIC,我們可以篩選出適合的聚類(lèi)數(shù)參數(shù)值。(歡迎轉(zhuǎn)載,請(qǐng)注明出處。 )
?
?