giddy.sequence.Sequence¶

class giddy.sequence.Sequence(y, subs_mat=None, dist_type=None, indel=None, cluster_type=None)[source]¶

Pairwise sequence analysis.

Dynamic programming if optimal matching.

Parameters:

yarray: one row per sequence of neighborhood types for each spatial unit. Sequences could be of varying lengths.
subs_matarray: (k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.
dist_typestring: “hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].
indelfloat: insertion/deletion cost.
cluster_typestring: cluster algorithm (specification) used to generate neighborhood types, such as “ward”, “kmeans”, etc.

Examples

>>> import numpy as np

1. Testing on unequal string sequences 1.1 substitution cost matrix and indel cost are not given, and will be generated based on the distance type “interval”

>>> seq1 = 'ACGGTAG'
>>> seq2 = 'CCTAAG'
>>> seq3 = 'CCTAAGC'
>>> seqAna = Sequence([seq1,seq2,seq3],dist_type="interval")
>>> seqAna.k
4
>>> seqAna.classes
array(['A', 'C', 'G', 'T'], dtype='<U1')
>>> seqAna.subs_mat
array([[0., 1., 2., 3.],
       [1., 0., 1., 2.],
       [2., 1., 0., 1.],
       [3., 2., 1., 0.]])
>>> seqAna.seq_dis_mat
array([[ 0.,  7., 10.],
       [ 7.,  0.,  3.],
       [10.,  3.,  0.]])

1.2 User-defined substitution cost matrix and indel cost

>>> subs_mat = np.array(
...     [
...         [0, 0.76, 0.29, 0.05],
...         [0.30, 0, 0.40, 0.60],
...         [0.16, 0.61, 0, 0.26],
...         [0.38, 0.20, 0.12, 0]
...     ]
... )
>>> indel = subs_mat.max()
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0.  , 1.94, 2.46],
       [1.94, 0.  , 0.76],
       [2.46, 0.76, 0.  ]])

1.3 Calculating “hamming” distance will fail on unequal sequences

>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming")
Traceback (most recent call last):
ValueError: hamming distance cannot be calculated for sequences of unequal lengths!

Testing on equal string sequences

>>> seq1 = 'ACGGTAG'
>>> seq2 = 'CCTAAGA'
>>> seq3 = 'CCTAAGC'

2.1 Calculating “hamming” distance

>>> seqAna = Sequence([seq1,seq2,seq3], dist_type="hamming")
>>> seqAna.seq_dis_mat
array([[0., 6., 6.],
       [6., 0., 1.],
       [6., 1., 0.]])

2.2 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 2) - give the same sequence distance matrix as “hamming” distance

>>> subs_mat = np.array(
...     [
...         [0., 1., 1., 1.],
...         [1., 0., 1., 1.],
...         [1., 1., 0., 1.],
...         [1., 1., 1., 0.]
...     ]
... )
>>> indel = 2
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0., 6., 6.],
       [6., 0., 1.],
       [6., 1., 0.]])

2.3 User-defined substitution cost matrix and indel cost (distance between different types is always 1 and indel cost is 1) - give a slightly different sequence distance matrix from “hamming” distance since insertion and deletion is happening

>>> subs_mat = np.array(
...     [
...         [0., 1., 1., 1.],
...         [1., 0., 1., 1.],
...         [1., 1., 0., 1.],
...         [1., 1., 1., 0.]
...     ]
... )
>>> indel = 1
>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat, indel=indel)
>>> seqAna.seq_dis_mat
array([[0., 5., 5.],
       [5., 0., 1.],
       [5., 1., 0.]])

Not passing proper parameters will raise an error

>>> seqAna = Sequence([seq1,seq2,seq3])
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!

>>> seqAna = Sequence([seq1,seq2,seq3], subs_mat=subs_mat)
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!

>>> seqAna = Sequence([seq1,seq2,seq3], indel=indel)
Traceback (most recent call last):
ValueError: Please specify a proper `dist_type` or `subs_mat` and `indel` to proceed!

Attributes:

seq_dis_matarray: (n,n), distance/dissimilarity matrix for each pair of sequences
classesarray: (k, ), unique classes
kint: number of unique classes
label_dictdict: dictionary - {input label: int value between 0 and k-1 (k is the number of unique classes for the pooled data)}

__init__(y, subs_mat=None, dist_type=None, indel=None, cluster_type=None)[source]¶

Methods

__init__(y[, subs_mat, dist_type, indel, ...])