{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Alignment-based sequence methods\n", "\n", "This notebook introduces the alignment-based sequence methods (operationalized by the Optimal Matching (OM) algorithm), which was originally developed for matching protein and DNA sequences in biology and used extensively for analyzing strings in computer science and recently widely applied to explore the neighborhood change. \n", "\n", "It generally works by finding the minimum cost for aligning one sequence to match another using a combination of operations including substitution, insertion, deletion and transposition. The cost of each operation can be parameterized diferently and may be theory-driven or data-driven. The minimum cost is considered as the distance between the two sequences.\n", "\n", "The `sequence` module in `giddy` provides a suite of alignment-based sequence methods. \n", "\n", "**Author: Wei Kang **" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import warnings\n", "\n", "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", " # ignore NumbaDeprecationWarning: gh-pysal/libpysal#560\n", " import libpysal\n", " import mapclassify as mc" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [2, 2, 2, ..., 1, 1, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ...,\n", " [1, 1, 1, ..., 0, 0, 0],\n", " [3, 3, 2, ..., 2, 2, 2],\n", " [3, 3, 3, ..., 4, 4, 4]])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = libpysal.io.open(libpysal.examples.get_path(\"usjoin.csv\"))\n", "pci = np.array([f.by_col[str(y)] for y in range(1929, 2010)])\n", "q5 = np.array([mc.Quantiles(y, k=5).yb for y in pci]).transpose()\n", "q5" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(48, 81)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q5.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import `Sequence` class from `giddy.sequence`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", " # ignore NumbaDeprecationWarning: gh-pysal/libpysal#560\n", " from giddy.sequence import Sequence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `\"hamming\"`\n", "\n", "* substitution cost = 1\n", "* insertion/deletion cost = $\\infty$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_hamming = Sequence(q5, dist_type=\"hamming\")\n", "seq_hamming" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 75., 7., ..., 21., 81., 78.],\n", " [75., 0., 80., ..., 79., 57., 73.],\n", " [ 7., 80., 0., ..., 14., 81., 81.],\n", " ...,\n", " [21., 79., 14., ..., 0., 81., 81.],\n", " [81., 57., 81., ..., 81., 0., 51.],\n", " [78., 73., 81., ..., 81., 51., 0.]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_hamming.seq_dis_mat # pairwise sequence distance matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `\"interval\"`\n", "\n", "Assuming there are $k$ states in the sequences and they are ordinal/continuous.\n", "\n", "* substitution cost = differences between states\n", "* insertion/deletion cost = $k-1$" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_interval = Sequence(q5, dist_type=\"interval\")\n", "seq_interval" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 123., 7., ..., 21., 190., 225.],\n", " [123., 0., 130., ..., 116., 69., 108.],\n", " [ 7., 130., 0., ..., 14., 197., 232.],\n", " ...,\n", " [ 21., 116., 14., ..., 0., 183., 218.],\n", " [190., 69., 197., ..., 183., 0., 61.],\n", " [225., 108., 232., ..., 218., 61., 0.]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_interval.seq_dis_mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `\"arbitrary\"`\n", "\n", "* substitution cost = 0.5\n", "* insertion/deletion cost = 1" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_arbitrary = Sequence(q5, dist_type=\"arbitrary\")\n", "seq_arbitrary" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0. , 37.5, 3.5, ..., 10.5, 40.5, 39. ],\n", " [37.5, 0. , 40. , ..., 39.5, 28.5, 36.5],\n", " [ 3.5, 40. , 0. , ..., 7. , 40.5, 40.5],\n", " ...,\n", " [10.5, 39.5, 7. , ..., 0. , 40.5, 40.5],\n", " [40.5, 28.5, 40.5, ..., 40.5, 0. , 25.5],\n", " [39. , 36.5, 40.5, ..., 40.5, 25.5, 0. ]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_arbitrary.seq_dis_mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `\"markov\"`\n", "\n", "* substitution cost = $1-\\frac{p_{ij}+p_{ji}}{2}$ where $p_{ij}$ is the empirical rate of transitioning from state $i$ to $j$\n", "* insertion/deletion cost = 1" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Markov Chain is irreducible and is composed by:\n", "1 Recurrent class (indices):\n", "[0 1 2 3 4]\n", "0 Transient classes.\n", "The Markov Chain has 0 absorbing states.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_markov = Sequence(q5, dist_type=\"markov\")\n", "seq_markov" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0. , 72.31052406, 6.34073233, ..., 19.02219698,\n", " 80.2334688 , 77.48002783],\n", " [72.31052406, 0. , 77.05042347, ..., 74.77437281,\n", " 50.75696949, 65.9128181 ],\n", " [ 6.34073233, 77.05042347, 0. , ..., 12.68146465,\n", " 80.97128589, 80.51785856],\n", " ...,\n", " [19.02219698, 74.77437281, 12.68146465, ..., 0. ,\n", " 80.10306616, 80.46369148],\n", " [80.2334688 , 50.75696949, 80.97128589, ..., 80.10306616,\n", " 0. , 41.57088046],\n", " [77.48002783, 65.9128181 , 80.51785856, ..., 80.46369148,\n", " 41.57088046, 0. ]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_markov.seq_dis_mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `\"tran\"`\n", "\n", "* **Biemann, T.** (2011). A Transition-Oriented Approach to Optimal Matching. Sociological Methodology, 41(1), 195–221. DOI: [10.1111/j.1467-9531.2011.01235.x](https://doi.org/10.1111/j.1467-9531.2011.01235.x)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_tran = Sequence(q5, dist_type=\"tran\")\n", "seq_tran" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 23., 8., ..., 12., 24., 21.],\n", " [23., 0., 17., ..., 16., 28., 22.],\n", " [ 8., 17., 0., ..., 4., 18., 16.],\n", " ...,\n", " [12., 16., 4., ..., 0., 21., 15.],\n", " [24., 28., 18., ..., 21., 0., 23.],\n", " [21., 22., 16., ..., 15., 23., 0.]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_tran.seq_dis_mat" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 23., 8., ..., 12., 24., 21.],\n", " [23., 0., 17., ..., 16., 28., 22.],\n", " [ 8., 17., 0., ..., 4., 18., 16.],\n", " ...,\n", " [12., 16., 4., ..., 0., 21., 15.],\n", " [24., 28., 18., ..., 21., 0., 23.],\n", " [21., 22., 16., ..., 15., 23., 0.]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq_tran.seq_dis_mat" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:py311_giddy]", "language": "python", "name": "conda-env-py311_giddy-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }