본문 바로가기

Bioinformatics(생정보학)

Consensus clustering

728x90
반응형

샘플들을 frequency로 묶인정도를 파악한 후에 cluster의 robustness를 파악하는 방법임

CDF의 유의미한 차이가 있는 것까지를 선택함.


2014년 scientific report에 따르면

위의 방식이 최적의 k를 보장하지 않을 수 있다는 결과가 발표되었고

이에 따라 PAC(Proportion of ambiguously clustered)로 파악하는 것이 좋다는 논문이 나옴

Critical limitations of consensus clustering in class discovery, Scientific reports

https://www.nature.com/articles/srep06207


PAC 계산 관련 R code

https://www.biostars.org/p/198789/


######################################################## 
seed=11111
d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
colnames(d) = paste("Samp",1:200,sep="")
rownames(d) = paste("Gene",1:1000,sep="")
d = sweep(d,1, apply(d,1,median,na.rm=T))
maxK = 6 # maximum number of clusters to try
results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",
innerLinkage="complete",seed=seed,plot="pdf")

# Note that we implement consensus clustering with innerLinkage="complete". 
# We advise against using innerLinkage="average" which is the default value in this package as average linkage is not robust to outliers.

############## PAC implementation ##############
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 # threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec)) 
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
  M = results[[i]]$consensusMatrix
  Fn = ecdf(M[lower.tri(M)])
  PAC[i-1] = Fn(x2) - Fn(x1)
}#end for i
# The optimal K
optK = Kvec[which.min(PAC)]
########################################################


728x90
반응형