본문 바로가기

카테고리 없음

EGFR variant III detector.py

728x90
반응형

python3로 EGFRvariant III를 잡기위한 스크립트임

EGFR variant 3는 exon2부터 exon7까지 deletion된 형태의 EGFR로 glioblastoma에서 많이 나타남

해당 mutation은 WGS나 RNAseq으로만 잡을 수 있음.

아래 스크립트는 RNAseq에서 exon1번과 exon8번에 겹쳐있는 RNAread를 bam file에서 잡아내는 것임.

exon1과 8의 sequence는 GRCH37 reference genome based에서 가져왔으며

아래 스크립트는 bam file이 hg19에 alignment 된 것을 가져옴.

 

주의할 점은 굉장히 기초 수준의 스크립트이므로 exon1이나 exon8에 single nucleotide variant가 있을 경우 (특히 exon1의 3' end, exon8의 5' end) 못잡아 낼 가능성이 있음.

최소 5개의 sequence가 각각의 양 끝에 겹칠 경우 report해줌.

EGFRvIII_detector.py
0.00MB

# sudo apt-get install python3-pip
# python3 -m pip install pysam
# python3 -m pip install progressbar
# Import libraries
import pysam
import os
import progressbar
import numpy
# Set parameters sequence model from GRCH37.75
result_file='EGFRvIII_calling_results2.txt'
exon1_end_nucleotide_sequence='GCCGGAGTCCCGAGCTAGCCCCGGCGGCCGCCGCCGCCCAGACCGGACGACAGGCCACCTCGTCGGCGTCCGCCCGAGTCCCCGCCTCGCCGCCAACGCCACAACCACCGCGCACGGCCCCCTGACTCCGTCCAGTATTGATCGGGAGAGCCGGAGCGAGCTCTTCGGGGAGCAGCGATGCGACCCTCCGGGACGGCCGGGGCAGCGCTCCTGGCGCTGCTGGCTGCGCTCTGCCCGGCGAGTCGGGCTCTGGAGGAAAAGAAAG' # samtools faidx Homo_sapiens.GRCh37.75.dna.primary_assembly.fa 7:55086794-55087058
exon8_start_nucleotide_sequence='GTAATTATGTGGTGACAGATCACGGCTCGTGCGTCCGAGCCTGTGGGGCCGACAGCTATGAGATGGAGGAAGACGGCGTCCGCAAGTGTAAGAAGTGCGAAGGGCCTTGCCGCAAAG' #samtools faidx Homo_sapiens.GRCh37.75.dna.primary_assembly.fa 7:55223523-55223639

bam_files=os.listdir()
bam_files=list(filter(lambda x:'bam' in x, bam_files))
bai_files=list(filter(lambda x:'bai' in x, bam_files))
bam_files=list(set(bam_files)-set(bai_files))
progress_bar=progressbar.ProgressBar(maxval=len(bam_files)).start()

# Functions
def totuple(a):
    try:
        return tuple(totuple(i) for i in a)
    except TypeError:
        return a


# Conduct analysis
header=['BAM_files','EGFRvIII_supporting_sequence','Number_of_nucleotide_hanging_on_Exon8']
results=[header]
for idx, bam_file in enumerate(bam_files):
print(bam_files[idx])
progress_bar.update(idx)
loaded_bam=pysam.AlignmentFile(bam_files[idx],'rb')
exon1_start=[]
# Get reads starting from exon1 of EGFR
for read in loaded_bam.fetch('chr7',55086794,55087058):
exon1_start.append(str(read))


# Take nucleotide sequences from the exon1_start reads
exon1_start2=[]
for read in exon1_start:
tab_delim=read.split('\t')
exon1_start2.append(tab_delim[9])


# Get overlaps by comparing exon8_start_nucleotide_sequence and exon1_start2
overlap_exon8=[]
for read in exon1_start2:
checker=[]
for k in reversed(range(50,len(read)-3)):
read2=read[(k-1):len(read)]
locates=exon8_start_nucleotide_sequence.find(read2)
if locates!=-1 and locates==0:
checker.append(k)
if len(checker)==0:
overlap_exon8.append(0)
else:
overlap_exon8.append(len(read)-min(checker)+1)


# Get sequences with overhang exon8 region
x=numpy.array(overlap_exon8)
indexes=list(numpy.concatenate(numpy.where(x!=0)).tolist())
egfrvIII_reads=[]
overhang_num=[]
if indexes!=[0] and indexes!=[]:
for j in indexes:
egfrvIII_reads.append(exon1_start2[j])
overhang_num.append(overlap_exon8[j])
else:
egfrvIII_reads=['Nothing']
overhang_num=['Nothing']

egfrvIII_reads = ','.join(egfrvIII_reads)
overhang_num = ','.join(map(str,overhang_num))
# Export results with supporting reads
res=[bam_files[idx],egfrvIII_reads,overhang_num]

#res='\t'.join(res)
results.append(res)
loaded_bam.close()


# Write table
final=results

import csv
with open(result_file,'w') as csvfile:
writer = csv.writer(csvfile,delimiter='\t')
[writer.writerow(r) for r in final]

EGFRvIII_detector.py
0.00MB / 0.00MB

 

728x90
반응형