Squiggle¶
Squiggle is a two-dimensional DNA sequence visualization library that can turn FASTA sequence files like this:
>lcl|NC_000011.10_cds_NP_000509.1_1 [gene=HBB]
ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG
TTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGG
GGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGT
GCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACT
GTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCA
TCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAAT
GCCCTGGCCCACAAGTATCACTAA
into gorgeous, interactive visualizations like this:
Installation¶
If you don’t have Python 3.4 or greater installed, be sure to get it. To get the current stable version of Squiggle, run:
$ pip install squiggle
Or, alternatively, if you want to get the latest development version:
$ pip install git+https://github.com/Lab41/squiggle.git
Usage¶
Using Squiggle is as easy as:
$ squiggle your_sequence.fasta
Squiggle has tons of options available to make beautiful, interactive visualizations of DNA sequences. To get a full rundown of the various option, take a look at the User Guide.
Citation¶
To be determined!
Table of Contents¶
Visualization Methods¶
There are a variety of ways to visualize DNA sequences in two dimensions. Squiggle provides its own novel visualization method as well as implementations of various other methods. Each method captures a different aspect of a sequence, so it is highly recommended to try using multiple methods in order to get a feel for a sequence.
Squiggle¶
Squiggle’s DNA visualization method is based on the UCSC .2bit format and the Qi et. al Huffman coding method. In essence, a DNA sequence is first converted into binary using the 2bit encoding scheme that maps T to 00, C to 01, A to 10, and G to 11. For example:
ATGC
becomes:
10001101
Then, starting at the origin, for each bit, the following vectors are layed end to end:
This mapping has the effect of giving each nucleotide a distinctive shape:
This encoding method has several handy features:
- Based on an open, common bioinformatics format.
- No degeneracy in the encoding (an encoding can only map to one sequence and vice versa).
- The overall GC-content can be inferred from at a glance based on whether the endpoint of the graph is above or below zero.
- Regions inside the gene with varying GC-content can be seen as peaks and valleys.
- Is limited to quadrants I and IV and is a function
- The \(x\)-axis corresponds directly with nucleotide position
- Supports ambiguous nucleotides (which are displayed as horizontal lines)
For an example, let’s look at the human β-globin gene using the squiggle method:
$ squiggle example_seqs/human_HBB.fasta
Gates¶
In Gates’s method, DNA
sequences are converted into 2D walks in which Ts, As, Cs, and Gs are up, down,
left, and right, respectively. This gives each sequence a “shape.” However,
there is degeneracy, meaning that a visualization is not necessarily unique. For
example, TGAC
is a square (up, right, down, and left), but so is GTCA
(right, up, left, down).
To see an example of Gate’s method, we’ll again look at human β-globin:
$ squiggle example_seqs/human_HBB.fasta --method=gates
Yau¶
Yau et. al’s method uses unit vectors with upward vectors indicating pyrimidine bases (C and T) and downward vectors indicating purine bases (A and G). Similar to Squiggle, this method has no degeneracy.
Specifically,
\(A\rightarrow\left(\frac{1}{2},-\frac{\sqrt{3}}{2}\right)\), \(T\rightarrow\left(\frac{1}{2},\frac{\sqrt{3}}{2}\right)\), \(G\rightarrow\left(\frac{\sqrt{3}}{2}, -\frac{1}{2}\right)\), \(C\rightarrow\left(\frac{\sqrt{3}}{2}, \frac{1}{2}\right)\).
Warning
The \(x\)-coordinate in Yau’s method is not equivalent to base position.
It produces a visualization of β-globin like this:
$ squiggle example_seqs/human_HBB.fasta --method=yau
Yau-BP¶
Unique to Squiggle is the Yau-BP method, a slight modification of Yau’s method that ensures that the \(x\) axis is equivalent to the base position. It preserves that salient feature of the method, which is the purine/pyrimidine split.
Randić and Qi¶
Randić et al. and Qi and Qi’s methods are similar to tablature, with a different base (or 2-mer in the case of Qi’s method) assigned to each \(y\) value. The best way visualize it is through an example.
Let’s look at the Randić visualization of GATC
:
Look’s pretty good. However, this visualization method isn’t well suited to long sequences, as we’ll see when we look at β-globin:
$ squiggle example_seqs/human_HBB.fasta --method=randic
Qi’s method produces very similar results, just with a much larger range of \(y\) values:
$ squiggle example_seqs/human_HBB.fasta --method=qi
User Guide¶
Squiggle is designed to be easy to use while still providing complete flexibility to the user. For the sake of demonstration, we’ll be using four different species’ β-globin genes (human, chimpanzee, rhesus macaque, and Norway rat).
For a full list of the command line options and their meanings, see the CLI Reference.
Basic Usage¶
The easiest way to visualize a sequence is by passing a FASTA file to Squiggle:
$ squiggle human_HBB.fasta
To use a different visualization method, provide --method
with a setting
(see Visualization Methods for a description of the supported methods):
$ squiggle human_HBB.fasta --method=gates
Plotting Multiple Sequences¶
If your FASTA file has multiple sequences, they will get plotted together automatically. If, however, your sequences are in separate files, you can still plot them together by passing multiple files to Squiggle:
$ squiggle human_HBB.fasta chimpanzee_HBB.fasta norway_rat_HBB.fasta rhesus_HBB.fasta
To put them on separate plots, use the --separate
flag:
$ squiggle human_HBB.fasta chimpanzee_HBB.fasta --separate
By default, their \(x\) axes are linked. This can be disabled with
--no-link-x
(try if for yourself by panning around):
$ squiggle human_HBB.fasta chimpanzee_HBB.fasta --separate --no-link-x
Similarly, the \(y\) axes can be linked and unlinked with
--link-y/--no-link-y
.
Note that when plotting seperately, Squiggle will try to make the layout as
square as possible. If you want to specify the number of columns, you can do so
with the -c
option.
If you want to compare FASTA files, you can use the --mode=file
flag to
treat each file as a separate entity, as opposed to each sequence. The
--mode=auto
flag (which is the default) will attempt to visualize each
sequence independently unless there are too many, in which case it will switch
to file mode.
As an example, let’s compare the highly expressed genes of E. coli and B. anthracis:
$ squiggle ecol.heg.fasta banth1.heg.fasta