giddy.sequence.Sequence¶
- class giddy.sequence.Sequence(y, subs_mat=None, dist_type=None, indel=None, cluster_type=None)[source]¶
Pairwise sequence analysis.
Dynamic programming if optimal matching.
- Parameters:
- yarray
one row per sequence of neighborhood types for each spatial unit. Sequences could be of varying lengths.
- subs_matarray
(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.
- dist_typestring
“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].
- indelfloat
insertion/deletion cost.
- cluster_typestring
cluster algorithm (specification) used to generate neighborhood types, such as “ward”, “kmeans”, etc.
Examples
>>> import numpy as np
1. Testing on unequal string sequences 1.1 substitution cost matrix and indel cost are not given, and will be generated based on the distance type “interval”
>>> seq1 = 'ACGGTAG' >>> seq2 = 'CCTAAG' >>> seq3 = 'CCTAAGC' >>> seqAna = Sequence([seq1,seq2,seq3],dist_type="interval") >>> seqAna.k 4 >>> seqAna.classes array(['A', 'C', 'G', 'T'], dtype='<U1') >>> seqAna.subs_mat array([[0., 1., 2., 3.], [1., 0., 1., 2.], [2., 1., 0., 1.], [3., 2., 1., 0.]]) >>> seqAna.seq_dis_mat array([[ 0., 7., 10.], [ 7., 0., 3.], [10., 3., 0.]])
1.2 User-defined substitution cost matrix and indel cost
>>> subs_mat = np.array( ... [ ... [0, 0.76, 0.29, 0.05], ... [0.30, 0, 0.40, 0.60], ... [0.16, 0.61, 0, 0.26], ... [0.38, 0.20, 0.12, 0] ... ] ... ) >>> indel = subs_mat.max() >>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel) >>> seqAna.seq_dis_mat array([[0. , 1.94, 2.46], [1.94, 0. , 0.76], [2.46, 0.76, 0. ]])
1.3 Calculating “hamming” distance will fail on unequal sequences
>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming") Traceback (most recent call last): ValueError: hamming distance cannot be calculated for sequences of unequal lengths!
Testing on equal string sequences
>>> seq1 = 'ACGGTAG' >>> seq2 = 'CCTAAGA' >>> seq3 = 'CCTAAGC'
2.1 Calculating “hamming” distance
>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming") >>> seqAna.seq_dis_mat array([[0., 6., 6.], [6., 0., 1.], [6., 1., 0.]])
2.2 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 2) - give the same sequence distance matrix as “hamming” distance
>>> subs_mat = np.array( ... [ ... [0., 1., 1., 1.], ... [1., 0., 1., 1.], ... [1., 1., 0., 1.], ... [1., 1., 1., 0.] ... ] ... ) >>> indel = 2 >>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel) >>> seqAna.seq_dis_mat array([[0., 6., 6.], [6., 0., 1.], [6., 1., 0.]])
2.3 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 1) - give a slightly different sequence distance matrix from “hamming” distance since insertion and deletion is happening
>>> subs_mat = np.array( ... [ ... [0., 1., 1., 1.], ... [1., 0., 1., 1.], ... [1., 1., 0., 1.], ... [1., 1., 1., 0.] ... ] ... ) >>> indel = 1 >>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel) >>> seqAna.seq_dis_mat array([[0., 5., 5.], [5., 0., 1.], [5., 1., 0.]])
Not passing proper parameters will raise an error
>>> seqAna = Sequence([seq1,seq2,seq3]) Traceback (most recent call last): ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat) Traceback (most recent call last): ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!
>>> seqAna = Sequence([seq1,seq2,seq3], indel=indel) Traceback (most recent call last): ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!
- Attributes:
- seq_dis_matarray
(n,n), distance/dissimilarity matrix for each pair of sequences
- classesarray
(k, ), unique classes
- kint
number of unique classes
- label_dictdict
dictionary - {input label: int value between 0 and k-1 (k is the number of unique classes for the pooled data)}
Methods
__init__
(y[, subs_mat, dist_type, indel, ...])