| Rankings: Even further under the hood

Recently Lou Burruss wrote a series of articles about the USAU rankings algorithm. Along with a long RSD thread, this has spurred a lot of conversation about the rankings, how they are generated, and how they should be used. In this article, I’ll try to go deep into the mathematics of the algorithm, and discuss what it means in relation to various proposals to change it or to use it differently. This is for geeks only; if you’re only interested in the algorithm insofar as it affects your team, you should stop reading.

Convergence

The algorithm is an iterative one, which brings up the question of convergence – will the numbers limit towards a single numerical rating for each team, or might they swing wildly around? (I’m not sure why USAU only does 20 iterations – I’d go 1000 rounds at least. Why not?) In one of his articles, Lou and some commenters discussed the possibility that there might be chaotic behavior. The good news is that we can prove that this will almost certainly not happen. If your team has played in any round-robin pool of three or more teams, your rating will converge. If you’ve played a team that has played a round-robin, your rating will converge. If you’ve played a team that has played a team that has played a round-robin, your rating will converge, and so on. The only realistic scenario in which a team’s rating wouldn’t converge would be one in which two teams only play each other. But teams like that are unlikely to make it to the minimum games requirement, unless there is serious collusion. If USAU implemented a rule that says that teams must play at least 2 different opponents for their games to count towards any rankings, this would effectively eliminate any possibility of non-convergence, and it would not affect the rankings of any other team.

Here’s the mathematical proof that the algorithm will converge outside of those limited exceptions. Something to notice is that the convergence proof relies on the symmetry of rating differentials and of weights. That is, the rating differential for a given game is the same for both teams (if team A beats team B, then the game score for team A is team B’s rating plus some number, and the game score for team B is team A’s rating minus that same number), and the weight placed on the game is the same for team A as for team B.

Understanding these preconditions allows us to better evaluate potential modifications to the algorithm. For instance, someone suggested that weights should based on whether the tournament was the first, second, etc. of the season for the team. However, this would break the weight symmetry if one team was on its second tourney but its opponent was on its first, and thus it might affect the convergence of the algorithm in unexpected ways.

Tweaking the Algorithm

Of course, just because the algorithm converges to a set of ratings doesn’t mean the numbers it converges to are any good. Many people have put forward suggestions for changes – some minor, some major. I submitted my own detailed proposal last summer. Most of the rest of this article will talk about how some of these changes might affect the working of the algorithm.

From the standpoint of the algorithm, the easiest tweaks to make are ones that just affect the rating differentials and the weights. As long as the symmetry isn’t broken, one could use any numbers for the rating differentials and weights that one chooses, and the algorithm will still converge. Whether they make sense is another matter entirely.

Right now, the differential that accrues from a 15-10 game is closer to that of a 15-14 game than that of a 15-8 game. I would suggest that this is exactly opposite from how it ought to be. This has been brought up in a number of places, so I won’t dwell on it.

Regarding weights, it has been pointed out that the relative weights of games changes depending on the date of the running of the algorithm. For instance, the ratings that came out right after Easterns will view Easterns results as 2.16 times as important as Queen City results. However, the next week’s ratings will weight Easterns only 1.78 times as much. Thus, the rankings will change even if no new games have been played. One could switch to an exponential system in which the weight for each week is a fixed percentage higher than the weight for the previous week, or one could dispense with formulas and simply divide up the season into blocks and assign games in a given block a fixed weight, as Lou Burruss has suggested. (USAU could then require that teams play games that have a minimum total weight in order to have their rating count, rather than just require 10 games, period.)

I would also advocate weighting games based not only on date, but also on duration – a game that ends at 6-2 should not be as meaningful as a game that ends 15-5. Again, changes like this wouldn’t have any effect on the convergence of the algorithm.

Blowouts

Those are the easy tweaks. What about the problem where teams are penalized for playing inferior opponents, even if they win 15-0? This means that a TD could really handicap a team simply by placing a bad team or two into its pool. The fix that everybody seems to agree on is that such games should be ignored – in other words, given weight zero – if the winning team’s rating is more than the losing team’s rating plus 606 (the current maximum rating differential). I said above that you can change weights of games as much as you want and it wouldn’t affect the convergence of the algorithm… so is there any potential problem?

Well, there is. The issue is that you are using the ratings to determine which games should be given weight zero. This means there is a feedback loop that exists – calculate the ratings (meaning, run 1000 iterations until convergence), figure out which scores to ignore, re-run the ratings ignoring those scores… but then there might be more scores that need to be ignored, and another calculation of the ratings needed, etc. I can imagine a situation where one game in which team A blows out team B is ignored, but the subsequent increase in team A’s rating causes the algorithm to no longer ignore some games in which team A got crushed, and that drops team A’s rating, and all of a sudden team A is no longer that far ahead of team B, and they’d prefer to reinstate that game. And on we go…

There is a pretty simple solution, though, which is to flag any games that are ignored one time through, and then not ignored a subsequent time, and just decree that they will henceforth be included no matter what. If the status of that game has fluctuated between ‘ignore’ and ‘not ignore’ then it’s very likely that the difference between the two teams’ ratings is very close to 606 one way or the other, so including the game is not going to make much difference.

As a side note, I will note that one can apply much the same technique to games for which there is no score reported, only a winner and a loser. One can argue that such games shouldn’t be taken into account at all, but if they are, what should be done with them? In a comment on this page, Adam Lerman claims to have deduced that USAU currently treats such games as a 15-0 score. This seems like a bad policy to me – for one thing, that’s far from where the game likely ended up; for another, this gives the winning team (which is more likely to have a rating that matters) every incentive to keep the score ambiguous. Instead, I think such games should be given the lowest possible ratings differential (right now, that’s around 160) and low weight, but that they should be ignored if the victor’s rating is more than 160 above the loser’s rating. That way, unreported scores can’t hurt the winner’s rating, but won’t help either if the winner was generally favored going into the game. (Think about it: if you think that team A should be rated around 1350, and team B 1000, and I told you that team A beat team B by an unspecified amount, would that change your opinion of their ratings much?)

Other Modifications

Fiddling with differentials and weights is relatively straightforward. Other modifications have been suggested that may have more profound implications on the algorithm (e.g., a “regression to the mean” factor). These would alter the properties of the algorithm in such a way that makes the proof of convergence inapplicable, and they would have to be tested independently. Alternatively, they could be applied non-iteratively as a one-time modification to the numbers after the algorithm has run its course.

For instance, let’s say you want to penalize forfeits without advantaging the team that was forfeited to. To do so in the iterative part of the algorithm would require a break in the symmetry of the weights or the differentials, which might have undesirable implications for convergence. It would be simpler to lower a team’s rating by a fixed number per forfeit only after all other calculations have been made.

Using the Ratings

We can use our understanding of the mechanics of the ratings system in order to better shape how bids are allocated. There are at least two ways that I can think of. First, we should realize that, even if the algorithm was optimal, there will always be a lot of noise and arbitrariness in the ratings: a team rated at 1603 is essentially indistinguishable from one rated at 1598. (As I write this, there are six such open teams in that range, ranked 17 through 22.) That inherent uncertainty (among other things) leads me to believe that the last bids to nationals shouldn’t be based merely on a comparison of one team’s rating to another, but rather should use a broader measure of regional strength. Such a policy would dampen the effects of noise a bit, and also reduce the ability of teams to manipulate the rankings for the purpose of nationals bids. Admittedly, there have been some philosophical arguments raised against this. For proposals and lengthy discussion, see here or here.

Another implication is the following. Irrespective of the algorithm, there is a reasonable argument to be made that no region should earn more than, say, 4-5 bids to nationals. After all, the point of the series is to crown a national champion, and if a team can’t finish in the top 25% (or so) of teams at regionals, they’re unlikely to be worthy of the honor.

But there is also a danger that the algorithm could systematically overrate an entire set of teams from a given region. Here’s how it might happen: let’s say that no team from a certain region X has played a team from outside. The algorithm runs, and it assigns teams ratings like normal, with teams in X and outside of X alike being rated between 500 and 1500. Now, if one 1000-rated team from region X subsequently plays one game against a 1000-rated team from outside the region and wins 15-7, what will happen is that every single X rating will go up by 303, and every single non-X rating will drop by the same amount. In other words, one single game between mediocre teams could cause an entire region to gain a 606-point advantage over everyone else.

Obviously, that is an extreme example. But if fuel prices continue to go up, cross-country match-ups might become rarer and rarer, and relative ratings between regions would depend on fewer and fewer results. A team that has a good President’s Day but has three of its stars develop food poisoning on the weekend of Chicago Invite would then have an outsized impact on an entire region or regions, especially if a couple of tournaments that usually draw a nation-wide field get rained out. A cap on the number of bids one region can earn would be a cheap bit of insurance against an extreme effect like this. (This is especially worrisome in D-3, where most teams don’t stray far from home and there are fewer wildcards; I’d recommend a cap of 3 per region in that division.)

Conclusion

The rankings system is a legacy tool, developed decades ago by Sholom Simon for unofficial purposes and inherited unchanged by today’s USAU. It has its flaws, some of them more obvious than others. As it is being used for more and more official purposes, it is only proper that it should come under heightened scrutiny. It should be incumbent upon USAU to make sure that the algorithm operates in as much accordance with common sense as possible, and to use the numbers it spits out with an understanding of the mathematical processes that came up with them.

Feature photo of Observer Steve Wang crouches in the distance as a play develops at the Mixed Club Championships. (Photo by Brandon Wu – UltiPhotos.com)