Neuroscience

de Brujin graphs and Velvet Optimiser

I'm working with the Velvet Assembler as part of my virus identification project. When I'm not trying to write a perl module to complement the already-bulky script files that I'm working with, I like to do something that I usually wish most other bioinformatics scientists I've met would do as well: I like to delve into the mathematics behind the program.

Velvet Assembler constructs de Brujin graphs in a manner like so:

source

Confusing, right? Velvet uses the separated small reads to make overlapping regions that be analyzed for errors and compressed for simplicity. These small reads are constructed into a de Brujin graph, which contains a list of all the overlapping segments. See those arrows in the picture above? Those are edges. And the rectangles are our nodes. We have to tell the computer the rules of how to play the game of read alignment before it can go crazy in the sandbox, so we've decided on a directed graph that can follow certain parameters. When the ending of one region connects with another, it forms a "bubble" that can be corrected using the Tour Bus algorithm, which is similar to the Dijkstra's algorithm, which we'll get to in another post.

And that's basically all I need to know about it for my research project. But, as with most mathematical oddities, that's not as much as I want to know.

With the help of my friend, Rosalind, I was introduced to de Brujin graphs. To construct one, we could easily write a program that iterates through a list of k-mers from each given segment and, then, record the ones that overlap by all but one of their base pairs. (My solution for the Rosalind problem can be found here, by the way.) This raises a lot of questions about how to reconstruct an entire genome from the smaller segments. The favorite problem of every computer scientist, the Traveling Salesman Problem, (and in some cases, something a bit more unique) arises when we have to decide what is the best way to connect each smaller segment into a larger picture.

I hope to optimize the Velvet Assembler portion of the Virus Detection program with a greater understanding of the mathematics behind it. So far, it seems that all my work simply relies on my ability to use different commands and put them in script files. And, recently, I've been bogged down with error message after error message in the implementation of a self-written subroutine module. Sometimes you need to relax with some puzzles. Even if the pieces are all so big that they overlap with one another.

- Visualization Of Whole-brain Connectivity
Growing up in the internet age, most of us are very familiar with network diagrams of all sorts. Indeed, this is also the age of the network. But surprisingly, while there are diagrams for almost every type of network you can conceive of, there is a remarkable...

- Longest Common Subsequence And Numpy
Perl may be crafty and efficient like a ninja, Ruby may be written like a prose or work of fiction, but, for most purposes, Python, with its simplicity and elegance, is usually my weapon of choice when it comes to programming languages. (To be frank,...

- Comparing Genome Alignment Methods
One of my current projects in the Matthew Hahn Lab is to investigate the effectiveness of a few different full-genome alignment methods. My mentor and I have been studying a new program called progressiveCactus, and comparing its output to other alignment...

- Building A Heap
Binary heaps are binary trees that are sorted in such a way that the parent leaf is either greater or lesser than both of its child nodes. When the root node is the greatest and each parent is greater than its children, then it's a max heap....

- Suboptimal Alignment Algorithm
Despite the existence of the Lalign program which could find internal duplications between two strings, I've developed my own algorithm for carrying out a similar function as part of this Rosalind problem. import distanceb="GACTCCTTTGTTTGCCTTAAATAGATACATATTTACTCTTGACTCTTTTGTTGGCCTTAAATAGATACATATTTGTGCGACTCCACGAGTGATTCGTA"a="ATGGACTCCTTTGTTTGCCTTAAATAGATACATATTCAACAAGTGTGCACTTAGCCTTGCCGACTCCTTTGTTTGCCTTAAATAGATACATATTTG"c=0da=""db=""for...

Neuroscience