0:24

And in these casinos, the favorite game was Cho-han,

Â when the dealer would toss two dice,

Â and the visitors would bet on the outcomes,

Â the outcomes being the sum of numbers on these two dice.

Â So "cho" means, as you probably guessed, "even", and "han" means "odd".

Â This is, of course, a fair game.

Â If the dice are not loaded, then there is 50/50 probability of winning.

Â However, in Bakuto casinos,

Â the dealers often used biased dice, "loaded" dice.

Â And in this case, the outcome may be shifted towards the

Â casino owner or towards insider backers.

Â Also, it may be fun ro play Cho-han, in a Yakuza run casino.

Â We will play an equivalent game,

Â where the dealer simply flips a coin, and we bet on heads or tails.

Â The challenge here

Â is that the dealer actually has a choice of two coins: fair and biased.

Â For the fair coin, the probability of heads is 1/2 of course, and for

Â the biased coin, the probability of heads is 3/4.

Â On this slide, I show the coins in blue and

Â green, but in reality, the coins are completely identical, so

Â you have no clue which coin the dealer used.

Â The question arises: Suppose the dealer made 100 flips and

Â 63 of them are heads;

Â which coin did the dealer use, fair or biased?

Â 2:13

This question, of course, doesn't make sense

Â because any coin, fair or biased, can result in 63 heads.

Â The right question to ask is: What coin is more likely

Â if 63 out of 100 flips resulted in heads?

Â How do we answer this question?

Â Here's a hint: 63 is a little bit closer to 75 than 50.

Â Should we then assume that the biased coin is more likely in this series of flips?

Â 2:47

Before we come to a conclusion, let's try to estimate some probabilities.

Â Let's consider a sequence of n flips, denoted X = x1 x2 ... xn,

Â with k heads.

Â The probability that this sequence was generated by the fair coin

Â is of course (1/2)^n.

Â 3:09

But the probability that it was generated by the biased coin

Â is (3/4)^k * (1/4)^(n-k)

Â How can we decide what coin is more likely?

Â Of course, if the probability of X being generated by the fair coin

Â is larger than the probability of X being generated by the biased coin,

Â then the fair coin is more likely.

Â Let's refer to these probabilities as the blue probability and

Â green probability, respectively,

Â when we talk about the probability of X being generated by the fair coin or biased coin.

Â And likewise, if the blue probability is smaller than the green probability,

Â then the biased coin is more likely.

Â However, when these probabilities are the same,

Â when in equilibrium state, when blue probability = green probability,

Â we can compute, by doing some arithmetic, that in this case,

Â k = log_2(3) * n

Â Or, in other words, k will be approximately equal to 0.632 multiplied by n.

Â Which means that when the number of heads,

Â or the fraction of the number of heads is smaller than 0.632,

Â it means that the dealer is more likely to have used the fair coin.

Â Therefore, our original intuition actually was incorrect.

Â If there are 63 heads in the series of 100 coin tosses,

Â then the dealer was more likely to use the fair coin,

Â even though 63 is closer to 75 than to 50.

Â The dealers in traditional Yakuza-run casinos were

Â shirtless to avoid accusation of tampering the dice

Â because a shirtless dealer is less likely to switch the fair dice into the biased dice.

Â But the reality was that even shirtless dealers were able to switch the dice,

Â and we will model these situations with two coins up the dealer's sleeve,

Â and instead of the previous case, where the dealer ran a serious of

Â 5:36

coin flips, but always used either the fair or the biased coin,

Â we will assume that the dealer now can change the coin at any moment,

Â with probability 0.1.

Â So, after watching a sequence of flips,

Â can you tell when the dealer was the fair coin and

Â when he was using the biased coin?

Â Our attempt to read the dealer's mind brings us to the following

Â casino problem: Given a sequence of coin flips, determine

Â when the dealer was using the fair coin and when he was using the biased coin.

Â Do you think it is a well-formulated computational problem?

Â 6:20

Of course it is not

Â because any outcome of coin tosses

Â could have been generated by any combination of fair and biased coins.

Â And therefore, we need to grade different scenarios, for example BBBBB, FFFFF,

Â 6:43

But if we try to do this, then the question arises:

Â How can we explore and grade

Â 2^n possible scenarios for coin tosses?

Â Here's a simple approach to reading the dealer's mind one window at a time.

Â Let's choose a small window, let's say a window of five tosses,

Â and for each such window,

Â let's decide which coin is more likely to generate the outcome in this window.

Â We already know how to answer this question.

Â For example, for this window, HHHTH, there are 80% heads,

Â therefore it's most likely to have been generated by the biased coin.

Â For the second window, we have only 60% of heads,

Â and therefore, this is more likely to be generated by the fair coin.

Â We will actually use a ratio of blue and green probabilities,

Â and we will decide what coin generated the given outcome

Â by simply computing this ratio.

Â If this ratio is smaller than 1, then the biased coin is more likely.

Â If this ratio is higher than 1, then the fair coin is more likely.

Â And we will also introduce log-odds ratio,

Â which is simply the base-two logarithm of this ratio, and in this case,

Â as we know, the base-two logarithm of the ratio equals

Â (number of tosses) - log_2(3) * (number heads)

Â and our decision rule for deciding whether a given window was generated by the

Â biased or fair coin will simply be: If the log-odds ratio is less than zero,

Â then it is a biased coin.

Â If it is larger than zero, then it is a fair coin, and

Â it is represented by this simple rule.

Â 8:39

So, to continue reading the dealer's mind, we will go through the whole sequence,

Â and every time computing which coin is more likely, and then we will classify

Â all these windows into being more likely to be generated by the biased or the fair coin.

Â 8:57

The question however, arises: What are the disadvantages of this approach?

Â The obvious disadvantage of the sliding window approach is that different windows

Â may classify the same flip differently.

Â Also, the choice of the window length always affects our conclusions.

Â But how do we know how to choose the window length?

Â 9:21

And all that may be an interesting introduction to makeshift casinos and

Â coin flipping,

Â but what does it have to do with biology?

Â To explain what coin flipping has to do with biological applications, we will start

Â from a simple example of CG islands and then move to more complex examples.

Â 10:06

content 46% while platypus has GC content 58%,

Â which means that, on average,

Â if the distribution was uniform, you would expect that both nucleotides G and C

Â appear in the human genome with a probability of 23%.

Â And therefore, you would expect that each of the dinucleotides

Â (CC, CG, GC, and GG) appears in the genome with frequency 5.29%.

Â But the reality is that the frequency of CG in the human genome is 5 times smaller.

Â Why?

Â Methylation is a DNA modification that adds methyl CH3 group

Â to the cytosine nucleotide, and

Â it often happens to a C in a CG dinucleotide.

Â 11:01

The resulting methylated cytosine has a tendency to deaminate into

Â thymine, and as a result of methylation,

Â CG is the least frequent dinucleotide in many genomes.

Â However, methylation is often suppressed in the areas around

Â genes called CG-islands, where CG appears frequently.

Â For this example,

Â this would be a CG-island.

Â And the question arises: Suppose you want to find genes.

Â How would you search for CG-islands as a prerequisite for finding genes?

Â Well, if we were to use the same paradigm of Log-odds ratio,

Â and simply classify CG-islands as more likely in areas where this

Â Log-odds ratio is less than 0, and non-CG island as

Â more likely in the areas where Log-odds ratio is larger than 0,

Â then we will arrive at a classification algorithm for CG-islands.

Â However, there are a few issues.

Â As before,

Â different windows may classify the same position in the genome differently.

Â It is not clear how to choose the length of the window to construct CG-islands,

Â and very importantly it is not clear whether we should use the same

Â length of windows for different regions of the genome.

Â And in the next section, I will describe the notion of

Â Hidden Markov Models that provides a new, better paradigm, for

Â problems like finding CG Islands and many, many other problems in bioinformatics.

Â