[SOUND] As benchmarking has become more developed in bioinformatics, a certain number of generally desirable properties for benchmark tests have been proposed, so let's take a moment to consider some of those now. First, relevance, the relevance of the benchmark test is very important. It's important that the benchmark test should address real world problems. The issue of synthetic data versus real data is relevant here. Benchmark tests that are based on real biological data are preferable as they are above synthetic data. As they are closer to the real world problems at hand. Second, solvability, it's important that the benchmarking test should be not too easy and not too difficult in order that different algorithms can be discriminated against. If the benchmark test is too easy, all tools will perform very well and it would not be possible to choose among them. And this will defeat the object of the benchmark test and the same goes for if the benchmark test is too difficult. Thirdly, accessibility. The benchmark test needs to be very widely available. It would defeat the object of the benchmark if it was only applied to a small set of tools or algorithms. The whole point of the benchmark test is that all tools and algorithms can be compared in a standardized uniform matter. Fourth, scalability, as, particularly in bioinformatics, which is a burgeoning field, as progress is made in the development of algorithms and tools, it's important that the benchmark doesn't become instantly obsolete. So the benchmark task should be extensible so that, the current methods and future methods, can be compared with each other. Fifth, independence, it's a matter of circularity that the benchmark tests should not be informed by the very methods to be tested by that benchmark. And lastly, evolution, it's important that a benchmark should be continually updated. Otherwise, what might happen is that tools and algorithms were developed to do very well in the benchmark test, under the very specific gold standard of the benchmark test, but this will be an over-fitting scenario in that the tools and algorithms will not generalize so well. As I mentioned before, in bioinformatics, there's a very wide range of biological questions for which algorithms and tools have been designed to address. And each of these biological questions needs its own benchmark. Here what I'm going to do is I'm going to sort of narrow down the scope of that. So instead of dealing with all the questions of 3D protein structure. The multiple sequence alignments in image analysis, I'm just going to consider a couple of biological questions, and demonstrate some of the efforts that have been made to make benchmarking tests in these limited scenarios. The first one I'm going to consider, the first case study, is benchmark test for the prediction of protein biological function. Now the object of these methods is essentially a supervised machine learning classifier that takes in biological data. And tries to predict the biological function of proteins. In order to benchmark such methods, the authors of Huttenhower et al., devised a benchmark data set for protein function prediction. What the authors did is they took micro-array data and other genomic data and provided those as inputs to a suite of protein function prediction algorithms. The results of the predictions were then tested experimentally and the results were then used to generate a gold standard. This gold standard was then used and re-input into the protein function prediction algorithms in order to further improve the predictions which were experimentally validated and so on, so the process was iterated and there was an iterative improvement in the resulting gold standard of protein function. The result of this is a set of proteins with associated biological functions that are very high confidence, and these can be treated as a ground truth. And the capability of protein function supervise machine learning classifies to recover this ground truth in their analysis can then be used as a measure of the performance of the machine learning classifier. Interestingly, the authors of this paper noticed that their initial predictions of protein function was actually better than it first seemed. And this was because, there was an incomplete knowledge of the ground truth. So they were, initially recovering protein functions, which they didn't know were correct. The second example we'll look at is in the context of the analysis of a matrix microarray data. Microarrays work by hybridizing RNA to a chip which contains an array of probes. And there are multiple probes that correspond to each individual gene. The probe level, hybridization data, must then be summarized to produce an estimate for the expression level of the given gene. And there are different competing methods for performing this summarization process. In addition to this there is also the normalization process which I describe about in another talk. In normalization, the aim is to remove as much non-biological sources of variation as possible. And there are different methods for achieving this. So the Affycomp benchmark is designed to test the performance of summarization and normalization of half a matrix microarray data. And it does this with a series of graphics. You can find out more about the Affycomp benchmark suite from these websites. [MUSIC]