본문 바로가기

파이썬3

전체 서열 내에서 특정 부분 서열이 어디 있는지 찾는 스크립트.

728x90
반응형
def find_aligning_region(fullseq,subseq,extending=False):
    '''
    This function locates sub-sequences (subseq) within the given full-sequence (fullseq).
    Next, it returns list containing start and end location of the sub-sequence.
    Sub-sequence should be fully included in the full-sequence.
    # Input
    fullseq : Full-sequence (string)
    subseq : Sub-sequence (string). Its length should be shorter than that of fullseq.
    extending (default : False): Subsequence may be a pattern of reapeat region in fullsequence.
    In this case, it may be better representation to return the longest alining site by aggregating the indexes (extending=True).
    # Return
    list : [[start,end],[start2,end2],...,[startN,endN]] 0-index. the 'end' position is equivalent to 'start'+len(subseq)-1.
    if subseq is not included in fullseq, it will return ['N/A','N/A'].
    '''
    # Find starting location
    import re
    
    if subseq in fullseq:
        full_length,sub_length=len(fullseq),len(subseq)
        idx=[[start,start+sub_length-1] for start in range(0,full_length-sub_length+1) if subseq in fullseq[start:start+sub_length]]
    else:
        return [['N/A','N/A']]
    # Checking
    if extending:
        extended_idx=[]
        neighbor_idx=[]
        for i in range(len(idx)-1):
            i1=idx[i][0]
            i2=idx[i+1][0]
            if i2-i1==1:
                neighbor_idx.extend([i1,i2])
            else:
                if len(neighbor_idx)==0:
                    extended_idx.append(idx[i])
                else:
                    start_idx=neighbor_idx[0]
                    max_idx=neighbor_idx[-1]+sub_length-1
                    extended_idx.append([start_idx,max_idx])
                    neighbor_idx=[]
        extended_idx.append(idx[-1])
        return extended_idx
    else:
        return idx

## extending=False option
find_aligning_region(fullseq='AAATTGGAAAAAGAAA',subseq='AAA',extending=False)
#[[0, 2], [7, 9], [8, 10], [9, 11], [13, 15]]

## extending=True option
find_aligning_region(fullseq='AAATTGGAAAAGAAA',subseq='AAA',extending=True)
#[[0, 2], [7, 11], [13, 15]]

## subseq is not included in fullseq
find_aligning_region(fullseq='AAATTGGAAAAAGAAA',subseq='C',extending=True)
#[['N/A', 'N/A']]
728x90
반응형