728x90
반응형
샘플들을 frequency로 묶인정도를 파악한 후에 cluster의 robustness를 파악하는 방법임
CDF의 유의미한 차이가 있는 것까지를 선택함.
2014년 scientific report에 따르면
위의 방식이 최적의 k를 보장하지 않을 수 있다는 결과가 발표되었고
이에 따라 PAC(Proportion of ambiguously clustered)로 파악하는 것이 좋다는 논문이 나옴
Critical limitations of consensus clustering in class discovery, Scientific reports
https://www.nature.com/articles/srep06207
PAC 계산 관련 R code
https://www.biostars.org/p/198789/
########################################################
seed=11111
d = matrix(rnorm(200000,0,1),ncol=200) # 200 samples in columns, 1000 genes in rows
colnames(d) = paste("Samp",1:200,sep="")
rownames(d) = paste("Gene",1:1000,sep="")
d = sweep(d,1, apply(d,1,median,na.rm=T))
maxK = 6 # maximum number of clusters to try
results = ConsensusClusterPlus(d,maxK=maxK,reps=50,pItem=0.8,pFeature=1,title="test_run",
innerLinkage="complete",seed=seed,plot="pdf")
# Note that we implement consensus clustering with innerLinkage="complete".
# We advise against using innerLinkage="average" which is the default value in this package as average linkage is not robust to outliers.
############## PAC implementation ##############
Kvec = 2:maxK
x1 = 0.1; x2 = 0.9 # threshold defining the intermediate sub-interval
PAC = rep(NA,length(Kvec))
names(PAC) = paste("K=",Kvec,sep="") # from 2 to maxK
for(i in Kvec){
M = results[[i]]$consensusMatrix
Fn = ecdf(M[lower.tri(M)])
PAC[i-1] = Fn(x2) - Fn(x1)
}#end for i
# The optimal K
optK = Kvec[which.min(PAC)]
########################################################
728x90
반응형
'Bioinformatics(생정보학)' 카테고리의 다른 글
vep 특정 genome 및 gtf파일 사용하기 (0) | 2017.09.27 |
---|---|
ensembl archive (0) | 2017.06.20 |
integrative modeling of multi-omics data to identify cancer drivers and infer patient-specific gene activity (0) | 2017.05.26 |
TMM normalization (0) | 2017.05.24 |
htseq-count (0) | 2017.05.23 |