Rudi Cilibrasi, a contributor to the Avian Flu Help blog and a Machine Learning researcher in the Netherlands wrote CompLearn which is an open-source data mining toolkit and is using it for H5N1 analysis. He has generated a different set of 30 H5N1 strains in tree format (PDF version / PostScript version), using CompLearn, which demonstrates that the avian flu virus is mutating into a closer H2H strand which is surely a cause for concern. According to Rudi, the numbers around the edges are all very low indicating that all the viruses are pretty closely related except for the “k2″ subtree that includes duckShandong0932004, duckYokohamaaq102003, and the others off to the right past “k0″. You can see that it is bordered by high numbers and there are several high numbers within the subtree itself suggesting a fit that is not very close, perhaps genetically.
This suggests there may have been more intermediate steps that we might explore using different hypothetical subsets of 15-50 virii to see what the most-likely phylogeny leading up to them is. But the “k10″ subtree confirms an earlier comment by Dr. Niman that the Mongolian and Novobirisk strains are very closely related. The “k11″ subtree suggests that there was a transmission of virus between Korea and Japan with those very low numbers. Overall the S(T) score of 0.990241 means that the computer believes it has figured out the structure nearly perfectly. Now it's our job to figure out why.
(Suggestions for better choices of species to try are always welcome.) Rudi stresses that he would love to try more recent data and thinks that this is the most important use for this type of chart at the moment, i.e. to track which strains are going where and when new strains pop up we can match them to the nearest previously known strain in the hope that this can shed light on the epidemiology of the situation.
Rudi has clarified that the data sources for the sequence data and the distance approximation technique (read more about this on the FluWikie) were obtained from a fellow member at the Bird Flu Discussion Forum (AvianFluTalk) whose onlinse pseudonymn is “gs” and it is presumed that ‘gs’ extracted the data from one of the two Databases at FluWikie. Rudi was trying to determine Normalized Compression Distance, a modern more robust measure than normal multiple alignment like you get from BLAST in certain situations and more information about this is available from Paul Vitanyi on the FluWikie as well as the web or just search for/google Rudi's complete name which has a bunch of papers tagged along with examples. More importantly his measure is a lot easier to operate as it is essentially parameter-free and so is a good choice as a first approximation analysis of a medium to large group of samples of unknown relation.
On a final note, readers and bloggers are kindly requested to suggest Rudi a better group of 30 samples to use in order to provide and gain more insight (please email your suggestions to cilibrar at ofb.net) on the situation.
2 comments
Froody stuff. Probably over a lot of people’s heads, but really, really, REALLY cool. Blogged.
Just to clarify the phrasea
“Overall the S(T) score of 0.990241 means that the computer believes it has figured out the structure nearly perfectly.”
in the posting by Angelo Embuldeniya above. It means that the
tree represents in a qualitative sense the relative distances
between the virii genomes
in the supplied distance matrix almost perfectly. This translates
into true relatedness among the virii insofar as the distances
have captured it. The used distance is NCD which has previously successfully determined phylogeny trees from the full mtDNA of
species like mammals and fungi. So there is reason to believe its
output is dependable but it is certainly not infallible. In that
sense the NCD distance is comparable to allignment distance, but
it is less sensitive to position of substring sequences in the
sequence.