Visualization Methods

There are a variety of ways to visualize DNA sequences in two dimensions. Squiggle provides its own novel visualization method as well as implementations of various other methods. Each method captures a different aspect of a sequence, so it is highly recommended to try using multiple methods in order to get a feel for a sequence.

Squiggle

Squiggle’s DNA visualization method is based on the UCSC .2bit format and the Qi et. al Huffman coding method. In essence, a DNA sequence is first converted into binary using the 2bit encoding scheme that maps T to 00, C to 01, A to 10, and G to 11. For example:

ATGC

becomes:

10001101

Then, starting at the origin, for each bit, the following vectors are layed end to end:

Bokeh Plot

This mapping has the effect of giving each nucleotide a distinctive shape:

Bokeh Plot

This encoding method has several handy features:

  • Based on an open, common bioinformatics format.
  • No degeneracy in the encoding (an encoding can only map to one sequence and vice versa).
  • The overall GC-content can be inferred from at a glance based on whether the endpoint of the graph is above or below zero.
  • Regions inside the gene with varying GC-content can be seen as peaks and valleys.
  • Is limited to quadrants I and IV and is a function
  • The \(x\)-axis corresponds directly with nucleotide position
  • Supports ambiguous nucleotides (which are displayed as horizontal lines)

For an example, let’s look at the human β-globin gene using the squiggle method:

$ squiggle example_seqs/human_HBB.fasta
Bokeh Application

Gates

In Gates’s method, DNA sequences are converted into 2D walks in which Ts, As, Cs, and Gs are up, down, left, and right, respectively. This gives each sequence a “shape.” However, there is degeneracy, meaning that a visualization is not necessarily unique. For example, TGAC is a square (up, right, down, and left), but so is GTCA (right, up, left, down).

To see an example of Gate’s method, we’ll again look at human β-globin:

$ squiggle example_seqs/human_HBB.fasta --method=gates
Bokeh Application

Yau

Yau et. al’s method uses unit vectors with upward vectors indicating pyrimidine bases (C and T) and downward vectors indicating purine bases (A and G). Similar to Squiggle, this method has no degeneracy.

Specifically,

\(A\rightarrow\left(\frac{1}{2},-\frac{\sqrt{3}}{2}\right)\), \(T\rightarrow\left(\frac{1}{2},\frac{\sqrt{3}}{2}\right)\), \(G\rightarrow\left(\frac{\sqrt{3}}{2}, -\frac{1}{2}\right)\), \(C\rightarrow\left(\frac{\sqrt{3}}{2}, \frac{1}{2}\right)\).

Warning

The \(x\)-coordinate in Yau’s method is not equivalent to base position.

Bokeh Plot

It produces a visualization of β-globin like this:

$ squiggle example_seqs/human_HBB.fasta --method=yau
Bokeh Application

Yau-BP

Unique to Squiggle is the Yau-BP method, a slight modification of Yau’s method that ensures that the \(x\) axis is equivalent to the base position. It preserves that salient feature of the method, which is the purine/pyrimidine split.

Bokeh Plot

Randić and Qi

Randić et al. and Qi and Qi’s methods are similar to tablature, with a different base (or 2-mer in the case of Qi’s method) assigned to each \(y\) value. The best way visualize it is through an example.

Let’s look at the Randić visualization of GATC:

Squiggle Visualization

Look’s pretty good. However, this visualization method isn’t well suited to long sequences, as we’ll see when we look at β-globin:

$ squiggle example_seqs/human_HBB.fasta --method=randic
Bokeh Application

Qi’s method produces very similar results, just with a much larger range of \(y\) values:

$ squiggle example_seqs/human_HBB.fasta --method=qi
Bokeh Application