While Steelers quarterback Ben Roethlisberger had a slightly down year in 2016, his performance overall for most of his career has shown steady improvement. At a time when we are looking for the smallest, most insignificant nugget of football news to cling to as we wait the final few weeks until training camp begins, many of us begin to wonder how we can extrapolate past performance into predictions for the upcoming season.
As it turns out, it’s really not that hard, if you are willing to settle for basic, statistical modeling.
Using a standard statistical concept called linear regression, we can plot discrete data points and determine, from those points, a line that roughly equates the overall trajectory, upon which we can estimate with some degree of accuracy future performance at any point on that line.
In other words, we can make at least an educated guess.
Of course, there are far too many variables to model into the distant future: age, injury, the arrival or departure of other players, even legal, financial or familial issues. But we can use a player’s past career data to predict, at the very least, their performance in the upcoming season.
In the case of quarterback Ben Roethlisberger, we have a confluence of two factors that make linear regression especially useful: plenty of data, and a fairly steady improvement over time.
So, with all that said, how could 2017 stack up against past seasons for the potential future Hall-of-Fame passer?
A New Normal
Let’s start with the most basic quarterback statistic: yards. Of course, it’s impossible to compare even this, the simplest of numbers, from one season to the next, without massaging the data. That’s because injuries change the number of games played from one year to the next. So, we have to extrapolate (“normalize”) yards for each season to determine performance in a perfect, 16-game season.
In this case, the blue line represents actual yards, while the red line represents extrapolated yards for 16 games. The straight lines are the trend lines, showing the linear regression for each data set.
The actual yards, with a varying number of games, shows a broader deviation from the trend line, while the normalized yards is a tighter grouping. By leveling the playing field, we can eliminate injury as a variable. Don’t worry, though; we can still make use of injury history to determine the likelihood of missed time, which can be included in the final projections.
This is a simple concept: the larger the sample size, the more accurate the data. In a large data set, a single, large deviation from the rest of the data has a small impact on the trend line. In a small sample size, though, that deviation results in a much larger standard error — basically, the “wiggle room” for any future value predicted by the trend line. As we can see below for Le’Veon Bell, a small sample size leads to inaccurate estimation of a trend line.
When the standard error is viewed as a ratio of the total spread of values (difference between the minimum value and the maximum value), we see how much a small sample size affects standard error. In Roethlisberger’s case, the spread for actual yards is 1,798 yards, while the standard error is 525 yards, for a ratio of 0.29. For Bell, the spread for actual yards is 805 yards, while the standard error is 452.5 yards. That’s a ratio of 0.56, indicating Bell’s margin for error is a much larger percentage of the possible range than Roethlisberger’s.
What all of this shows us is that we can use statistical mathematics to predict future performance, but only for the upcoming season, with any realistic degree of accuracy. Additionally, we can see that more seasons of data make it easier to predict with accuracy.
That’s a lot to digest. But, in the rest of this series, we will apply these principles to make predictions for key numbers for the Steelers in 2017 — and then revisit them when the season is over to see how we did.