Many data analysis problems are not easily parallelizable, often because the relevant analyses require an all-by-all analysis step. Applying heuristics often requires approximation, which introduces errors, noise, and bias. Recently, in confronting the sequence assembly problem, my lab has applied probabilistic data structures as a lightweight and scalable approach to taking large assembly graphs and partitioning them so that they can be treated independently. While the technique is not directly generalizable, the overall approach should adapt to the general problem of decoupling or optimally "breaking" large, sparsely connected graphs into minimally coupled subsets. I believe the technique could apply to social networks, clustering, and breakdown of large matrices into block-diagonal form.
Basic programming knowledge is assumed, but no biology background is required. All the (actually very simple!) computer science concepts will be explained. Our implementation is open source.
About Dr. C. Titus Brown
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.