| The Great Flip Experiment: Heads vs. Tails

It started out simply enough. I was writing an article on Ultimate strategies (part of Paul’s How to Build a Championship Ultimate Team series) and I came to the part of the story on whether it’s better to pull or receive to open the game. Anyone who knows me knows that I believe that the team receiving the opening pull tends to win more games than chance would seem to allow. So naturally, the best way to insure your team receives the pull is to win the pre-game disc flip and elect to receive. Otherwise, you’re relying on your opponent to make that decision in your favor.

I am also of the opinion that the disc flip is not unbiased as to being heads or tails. My suspicions are that the disc has a greater probability of landing on tails (face down). This means that when presented with the offer to call the flip, it makes sense to call “tails” for a single flip or “even” on a double flip. It’s kind of a bold statement to say the disc lands upside down more often than not. Fortunately it’s a quick and easy test to perform; I’ll just go out in my yard, flip the disc a couple of times and tally the results. It shouldn’t take more than 5 minutes to accomplish, then its back inside where it’s warm and back to the story.

Collecting the Data

Before I went outside, I decided to make a quick call to my son, Ray (Sockeye #99). Ray works in the University of Washington’s Applied Physics Lab and has a degree in Applied Mathematics. He’s a good resource for questions about collecting meaningful test data. I asked Ray how many flips I needed to be confident of the results. “At least 500,” he answered. I looked out the window and could see the remains of the previous day’s sleet storm still on the ground. The temperature was still hovering around freezing. The prospect of standing out in freezing weather and flipping a disc 500 times was not appealing to me. So I then asked the fateful question, “Can I do the test indoors on carpet?” Would it yield the same results? Like a good son, Ray said yes.

I spent the next two hours flipping a disc in my living room. I recorded the data in 10 groups of 50 tosses per Ray’s suggested testing procedure. The results far exceeded my expectations. The data strongly supported my contention that the disc lands more often on tails then it does on heads. I sent the data off to Ray for a confidence assessment and he sent back a warning that the data seemed a bit odd. The heads/tails variations between the 10 groups of 50 flips seemed a bit too skewed for his comfort. This is the very reason probability tests are sometimes done in smaller subgroups. You can compare the subgroups to each other, which can give you a more complete understanding of the data than just looking at the overall heads/tails ratio for the 500 flips. In most natural phenomena there is an expected variability in the groups, something called a Gaussian (Normal) Distribution or bell curve. If the differences in the groups are near the Gaussian Distribution, then you can use basic statistics to make statements about your confidence in your conclusions given the data.

So the data distribution was a bit odd and in addition, Ray felt the probability for a tails result was much higher than he expected. I got the answer I wanted, but it was too good. There’s an interesting thing about researchers, good researchers keep looking for the answer even after they find one they like. This is what makes them different from radio talk show hosts and conspiracy theorists.

Well, off to the great & cold outdoors. I repeated the test procedure out in my yard and sent the data off to Ray for assessment. It was close but the data wasn’t giving me the answer I wanted. This data showed a slight tendency towards landing heads. Ray came back with the observation that the data was much better behaved and the results were closer to what he expected. Ray had expected a close to even split with a slight edge to tails also. This was all well and good, but something had caught my attention during the testing. The disc bounced more on the carpet then it did on the grass. Is this disc bounce important? Inquiring minds want to know.

With that in mind, I went off to a local sports complex and repeated the test again on artificial field turf. Field turf is bouncier than grass but less bouncy than my living room carpet and lacks any bumps to catch the disc edge. As I expected, the results were somewhere in between grass and carpet. The data was decidedly biased towards tails, but not as strongly as it was for testing on my carpet.

There was still a piece missing however. In one final test, I went off to the local kiddy playground and tested on wood chips. Wood chips are much softer than any other surface I tested on. Wood chips were also bumpy enough to catch the disc edge. While games aren’t played on wood chips, I felt the wood chips were a reasonable analog for a muddy field without my having to deal with the associated mess. As I had somewhat expected, the test results now favored the disc landing right side up.

In the four tests the results were mixed. Wood chips yielded a strong heads result, my grass lawn yielded a weak heads result, and both field turf and carpet yielded a strong tails result.

Having completed the single disc flip probability testing I was ready to take a look at single disc flips vs. double disc flips. There have been attempts to eliminate any bias in the pre-game disc flip by going to a double disc flip. The double disc flip does two things: first it reduces the bias in the probability (tendency for the flipping outcome to favor heads or tails) and second, it guarantees that an odd combination of disc results is never more probable then an even result (if you’re using two discs with the same flipping bias). This is because, if the probability of one disc flip result is greater than the other (it doesn’t matter which), the combinations that make for an even result have a higher probability then the combinations that make for an odd result. The math to support this statement follows along with the data presentation below.

So folks, here’s the summary. On a hard surfaced field, a disc flip is more likely to end up tails. On a soft field, it is more likely to end up heads. In a double disc flip, the probability never favors odd. You read it here first!

Two Fun Facts:

Your mind can play tricks on you when you are standing for hours flipping discs.
When standing in a kiddy play area flipping discs, women with small children really try to avoid you.

What follows below is the more scholarly presentation of the test data.

The Results

Figure 1, Raw Data

Figure 1, Raw Data contains a bar chart representation of the data collected in each of the four trials. Each trial consisted of 500 flips of a single disc. The same disc was used for all four trials. The data shows differences in the number of heads and tails depending on the surface being tested on. Or does it? A big part of any testing protocol is including measures to show your data is correct and not a statistical fluke.

For these trials, we decided to collect the data in 10 groups of 50 flips each. That way we could compare the 10 groups within each trial against each other and look at the statistical variations. If the variations create a certain pattern then there is some definable confidence in the results. Unfortunately, the ten groups of data collected are not sufficient in number to adequately describe the overall distribution of the heads/tails variations. To continue pursuing this analysis technique, the data needs to be enhanced using some tricks out of the statistician’s tool bag.

Ray used a process called bootstrapping to enhance the data which can give us an estimate of the accuracy of our results. I’ll let Ray explain it:

“Since each disc flip (or observation) is independent from any prior or future flips, the order of the recorded data isn’t really important. The 10 groups of 50 flips that were realized during this experiment represent only one way to choose those 10 groups from the data, and hence, form an incomplete description of the variance in the data. One could imagine randomizing the observations over and over again while keeping track of the heads/tails ratio in the new groups of 50. If you do this enough times and graphically plot your data, you will have a more accurate representation of the distribution of some statistic in your data (in this case the statistic is the heads/tails ratio). This is essentially the process of bootstrapping (with a few important things left out to keep it’s definition simple). In the case where the results form a normal distribution for your statistic, bootstrapping becomes an approximation to more basic statistical methods like calculating the standard error of a sample statistic.”

That should be clear enough. Now we can look at the individual trials to assess their goodness.

Wood Chips

Figure 2, Wood Chip Probability Distribution

Not a bad distribution; it’s fairly well behaved. It’s a bit lopsided, but that could be a function of the resampling.

Grass

Figure 3, Grass Probability Distribution

Figure 3 shows the distribution of disc tosses landing on grass. Again, it isn’t too bad, but it does have the same sort of asymmetry as the wood chip tosses.

Turf

Figure 4, Field Turf Probability Distribution

Figure 4 shows the distribution of toss results on synthetic field turf.

Carpet

Figure 5, Indoor Carpet Probability Distribution

Figure 5 shows the distribution of the tosses conducted using indoor carpet as the testing surface. Although this was the trial which caused questions initially, it does seem relatively well behaved.

What does it all mean?

Figure 6, Confidence Results

Figure 6 is the actual meat and potatoes of this report.

The boxes graphically represent the probability of the heads and tails results of the disc tosses for the four surfaces tested. For any single test there are two boxes: a notched, hourglass-shaped blue box (heads) and a rectangular red box (tails). The red horizontal line in the middle of each box is the average or expected value of heads or tails for that test session. If you add the heads average to the tails average, you’ll see it adds up to 100% for each trial.

The height of each box is a way of showing that the actual probability can’t be resolved to a single number. Rather, the expected probability can be found within an interval that has a certain degree of confidence associated with it. In the case of this graph, 95% confidence intervals were reported, which means that if we repeated these experiments 100 times, we would expect the actual probability to have been within these limits 95 of those 100 times. The confidence limits for a given experiment are determined by the shape of the measured probability distribution data (the variance) and by the number of data points. If you have lots of well behaved data the boxes are short. If you have just a few misbehaving data points, the boxes can be tall. Because our data is binary in nature (takes on only two possible values, 1 and 0) and all experiments collected the same number of data points, we should expect the 95% confidence limits to be nearly equal across experiments.

To understand the significance of our results, we can compare the confidence intervals to one another. For a given experiment, if the heads and tails intervals don’t overlap then we can say they are statistically different (we’ve taken a few liberties here with this statement to keep things simple). Across experiments, we can also compare the intervals (heads to heads or tails to tails) to see if they overlap (share % values). If so, then we can’t say with 95% confidence that results aren’t due to chance.

So if we look at the wood chip data we can be pretty sure that you’ll get more heads then tails in a disc flip. If on the other hand, you toss on synthetic field turf or in my living room, you can be pretty sure that you’ll get tails more often than heads. On grass things aren’t so clear cut. The data shows a slight tendency for the toss to land on heads, but a change in just 7 tosses would have given a different answer.

What we do know is that things harder then grass seem to produce tails and things softer than grass seem to produce heads. As for grass itself, there need to be more tests. How many more, you ask? What I’m hoping will happen is that the confidence intervals will shrink, but the expected values will remain somewhat stationary. This will validate the trend we see if collect enough data. However, we must be prepared for the scenario where the confidence intervals shrink, but the ratio averages also move closer to 50/50, so our results never motivate us to reject the hypothesis that the disc/surface is unbiased. Such is statistics, we can’t prove anything, only disprove things.

Coin Flip

Figure 7, Coin Flip Simulation

Figure 7, Coin Flip Simulation shows that 5,000 flips should be enough if the data trends hold. In the computer simulation, Ray flipped a computerized coin 5,000 times and plotted the cumulative average as it changed during the course of the 5,000 flips. Ray repeated this simulation 5,000 times and then plotted all 5,000 simulations on one graph. Looking at this graph, you can see that as the number of tosses increases, the coin flip split gets closer to 50% and the variance approaches 0. It converges quickly in the beginning, but then less so after about 500-600 tosses. The graph shows that 500 coin tosses has a variation of ±8%. This is consistent with Figure 6, Confidence Results.

If you take a magnifying glass and look way out at the right end of the graph, you can see that the error is less than ±2%. All we need to do is get 100 people flipping 100 discs 50 times on three different surfaced fields and we’ll have a better answer.

Now, why should you always call even on a double disc flip?

The double disc flip is used in Ultimate because it is less biased then a single disc flip. In the above study, it can be seen that a disc flip has a bias towards heads or tails depending on the surface conditions. If you have a good assessment of the field conditions, you can have the advantage on calling the disc flip. The double disc flip reduces any advantage. Here’s how:

Let’s say the conditions favor a 55% chance of the disc landing tails. This is the probability breakdown of the 4 possible toss combinations:

Tails – Tails = .55 x .55 = .3025

Tails – Heads = .55 x .45 = .2475

Heads-Tails = .45 x .55 = .2475

Heads – Heads = .45 x .45 = .2025

So, the probability of the double toss being even is .3025 + .2025 = .505 or 50.5%

The probability of the double toss being odd is .2475 + .2475 = .495 or 49.5%

You can see that the advantage to tails is much reduced, but it has been replaced with a sure advantage to an even call. It turns out you get the same probability for an even result if the original probability had favored heads.

Figure 8, Double Disc Flip Probability

The blue line is the probability of the double disc flip being even. The red line is the probability of the flip being odd. For any probability of a heads result from a single disc flip, the double disc flip favors even (except at that one center spot where the chances are even).

Conclusion

So there you have it folks; the outcome of the pregame disc flip isn’t completely random. A hard field has a higher likelihood of producing tails, a soft field favors producing heads and the double flip always favors even. While my study certainly wasn’t definitive and it misses some criteria for a valid scientific study, I do believe the trends it indicates are real. If winning the pregame disc flip is important to your game strategy, this is something to keep in mind. I know that I will always call even.