Methods for Evaluating Historical Players


Methods for Evaluating Historical Players

Because we have gone to great lengths to level the playing field, and because you can use our Batting Register and Pitching Register reports to sort these players on a wide variety of statistics, it might appear that the statistics on this disk represent Diamond Mind's rankings of the greatest players in history.

In a way, that's exactly what they are. But we're not claiming that this is the last word on the subject of ranking historical players, and we're not putting this disk forth as our attempt to enter the debate about the best way to do this. That debate has gone on forever and will continue to go on forever, and we're quite content to let others carry on that debate.

Our goal was simply to provide you with another way to enjoy Diamond Mind Baseball. We wanted to take advantage of the available statistics and make reasonable adjustments for era and park effects. We've made every effort to do that to the best of our ability, and if you feel that we have accomplished that much, we're happy.

In fact, we do not believe there is a perfect way to compare players from one era to another, even though a number of people have used different methods to come up with rankings of the greatest players in baseball history, and some have written entire books on the subject.

In this section, we'll point out the strengths and weaknesses of some of the methods we're familiar with, including our own, and see if we can convince you that there's no one right way to rank historical players.

Plus/minus versus percentages

Even if people agree that it's important to evaluate players relative to league average, there's room for disagreement about how to do that.

One approach is to use plus/minus differences, as in "he hit five more homers than the average player would have hit in the same number of plate appearances."

Another is to use percentages, as in "he hit 70% more homers per plate appearance than the average player." A third approach is to use standard deviations (more on this in a moment), as in "he was 2.3 standard deviations above the mean."

We used the plus/minus approach, which has the advantage of avoiding gross distortions when a player puts up big numbers in a category where the average is low.

In 1919, for example, Babe Ruth hit 29 homers in 532 plate appearances. That's roughly 55 homers per 1000 PA in a league where the average player hit 6 per 1000 PA. In other words, Ruth's rate was about 900% of the league average.

A strict use of the percentage method would cause one to project Ruth for a little over 240 homers per 1000 plate appearance if he played today, or over 150 homers in a 162-game season.

The plus/minus approach gives Ruth credit for hitting 49 more homers per 1000 plate appearances than the average hitter. If a season consists of 650 plate appearances, that's about 32 more homers per season.

Today, the average player hits about 28 homers per 1000 PA, or about 18 in a season of 650 PA. Using the plus/minus method, Ruth's 1919 season translates into about 50 homers in today's environment. We're much more comfortable with results like this than with a strict application of the percentage approach.

But the plus/minus approach is not without limitations. If a pitcher in the dead-ball era allowed no homers in 1000 batters faced, and the average pitcher allowed 6, that pitcher gets credit for preventing 6 homers. If you apply that result to a season when the normal homerun rate was 24 per 1000 BF, you'd rate that pitcher to allow 18 homers per 1000 BF.

Maybe that's the right answer, maybe it's not. The percentage method would rate him to allow zero homers today, too, but that doesn't seem reasonable. But it's not possible to say that 18 is definitely the right number, either.

Furthermore, a current-day pitcher can earn a homerun difference of -12 because he's pitching in an environment when the norm for homers is in the twenties. It's impossible for a pitcher from the dead-ball era to earn a homerun difference that low. It is also much less likely for a dead-ball era pitcher to earn a high homerun difference, and maybe that evens things out. But there's no way to be sure.

The bottom line is that we don't believe either approach is perfect. We prefer the results we get using the plus/minus approach, and that approach is
consistent with how we've done all of our season disks, so we went with it.

Standard deviations and the quality of the player pool

One particularly interesting approach was described in great detail by Michael Schell in his books "Baseball's All-time Best Hitters" and "Baseball's All-time Best Sluggers". The first book is focused on batting average, while the second covers all aspects of batting performance.

Schell's method consists of four key elements:

(a) he didn't want to penalize players for late career declines, so he evaluated all hitters on their first 8000 atbats,

(b) he adjusted for the ups and downs in league batting average by evaluating all players relative to their leagues,

(c) he used standard deviations to adjust for the overall quality of play, and

(d) he adjusted for park effects.

We didn't want to penalize players for getting called up at a very young age, so we chose to go with each player's best series of consecutive seasons, rather than always starting at the beginning of his career.

Schell's goal was to rank the 100 best hitters, so he could afford to limit his work to players with at least 8000 career atbats. We needed a full set of statistics, not just batting average, so we used plate appearances instead of atbats. We needed many more than 100 players, so we set the bar at 4000 plate appearances.

Like Schell, we adjusted for park effects and normalized against the league averages.

The use of standard deviations is one of the more interesting topics for discussion. Standard deviation is a measure of the extent to which a set of data points is clustered around the average versus being spread out. The greater the spread, the higher the standard deviation.

Schell argued that if the overall level of talent in a league is low, the good players are able to dominate the weaker ones to a greater extent, and the spread in batting average from top to bottom is greater. As the overall level of quality improves, it becomes harder for one player to separate himself from the pack, so the spread decreases.

In other words, if you measure the standard deviation of batting average, you can use it as a measure of the quality of that league. High values indicate low quality and vice versa.

Schell's book includes charts showing the changes in standard deviation over time for both leagues. The standard deviations were much higher in the early part of the 20th century but have settled down since. The implication is that the quality of baseball was much lower in the earlier years, so it was easier for players like Ty Cobb and Rogers Hornsby to dominate their leagues than it later was for players like Tony Gwynn and Alex Rodriguez.

That makes intuitive sense, and Schell put a lot of weight on it when evaluating his list of hitters. This is the adjustment that propelled Tony Gwynn to the top of his rankings and relegated Ty Cobb to number two.

We gave serious thought to using Schell's method for this project, but we were not convinced it was the right way to go.

Schell demonstrates that the batting average standard deviation has shown no upward or downward trend since the 1930s. It was higher in the first part of the 20th century, dropped steadily until the 1930s, and has drifted sideways since then.

Since the 1930s, though, there have been a lot of year-to-year fluctuations. And those fluctuations don't seem to fit the theory that standard deviation is a good measure of the level of talent.

We know the level of talent went down during World War II, but the standard deviation in the AL dropped in 1943, 1944, and 1945 before rising sharply in 1946 and 1947. That's exactly the opposite of what this theory would predict. In the NL during those years, the direction of the changes was more consistent with the theory, but the magnitude of the changes was very small.

The theory predicts that standard deviations should rise in expansion years, but this has not been the case. In the AL, it was below the long-term average when the league expanded in 1961 and only slightly above average in the expansions of 1969 and 1977. In the NL, it was below average in 1962 and 1993. The 1969 NL was well above average, as expected, but overall, these values don't lend a lot of credibility to the idea.

In another curious shift, from the early 1960s to the early 1970s, the standard deviations in the NL rose to levels that had not been seen in sixty years. We can't think of a reason related to player quality that explains this pattern.

While the theory makes sense to us, we just don't see enough consistency in the data to feel comfortable using standard deviations as the basis for our rating system. With so many unexplained fluctuations, we could end up over-rating and under-rating players just because the standard deviations happened to go one way or the other during their peak years.

To get another angle on the quality of play question, we designed and carried out our own study. We identified all players who earned a significant amount of playing time in consecutive seasons, and then we looked at how their stats changed from the first year to the second year.

The theory behind this study is that an expansion year introduces a significant number of new players into a league. The returning players are presumably of a higher quality than the new players, so the returning players should see their stats improve because they're now getting some of their atbats against expansion-quality players.

Looking at all years since 1901, not just expansion years, we noticed that returning players tend to perform at a slightly lower level in the second year, though the drop from one year to the next was only a few points of OPS.

This tendency for returning players to decline may indicate a general rise in the quality of play or a selection bias. Our sample included only those players who met a minimum playing time threshold in both seasons, and it's possible that players in the decline phase of their career were more likely to qualify than were the younger, better players who were about to take their jobs.

Schell's standard deviation work suggests that the quality of play was noticeably lower in those early years, improved rapidly until the 1930s, and then settled down to a steady state. If that was true, we would have expected to see a more rapid rate of decline among returning players in those early years.

But the rate of decline of the returning players was no different in the early years than at other times. That's consistent with the idea that the quality of the player pool has been improving, slowly but steadily, from day one.

We did see plenty of evidence of a decline in quality in the war years and expansion years. In every case, the returning players improved. This supports the idea that they now had the opportunity to beat up on the weaker players who had come into the league in that second season, enough to overcome the normal rate of decline.

Those expansion effects were relatively small, however, and didn't last very long. As a result, they weren't enough to make an impact on player ratings that were based on seven or eight seasons of playing time, and we chose not to make any expansion year adjustments.

We did adjust for the World War II years of 1943 to 1945. The change in performance was more striking for those three seasons, and it immediately reversed itself when the players came back in 1946. Even this adjustment didn't make a big difference for the players who were affected, because these players were also being rated on several other seasons in which they faced the best players of the day.

And when we chose to include all years back to 1876 for the 2006 update to this disk, we began discounting the early years of professional baseball to account for the lower quality of play during that era. Without these adjustments, too many of the top players from the 1880s ranked at or near the top at their respective positions.

Summing up

The standard deviation approach provides some evidence supporting the plausible notion that it was easier to dominate a 1906 league than a 1996 league, but it doesn't support the idea that the quality of play declined in WWII and in the expansion years.

The returning players study suggests that the quality of play has been improving slowly since the beginning and continues to do so. And it does support the idea that quality is diluted when war or expansion introduces a lot of new players at one time.

The bottom line is that we chose not to make a timeline adjustment like the one Schell made. We reached that conclusion for several reasons.

First, it is difficult to decide how to quantify the rate of improvement and over what period of time to apply it. The standard deviation work and the returning players work suggest different patterns of improvement, and neither provided us with a clear result that we felt comfortable with.

Second, we believe that anyone who believes they can compare 1906 and 1996 stats with mathematical precision is fooling themselves. No matter how much work we do with differences, percentages, or standard deviations, we're never going to know what numbers Ty Cobb would have put up in today's game.

Maybe Cobb's career stats would translate in a direct way, or maybe he would take one look at today's smaller parks and change his approach, trading off thirty points of batting average for another fifteen homers. We just don't know.

Third, if we assume that the quality of play has indeed been improving steadily and continues to do so, today's players must be far better than those of a hundred years ago. That argument makes sense when you consider the substantial and measurable improvements that have been made in other athletic endeavors like track and field. Modern athletes are stronger, faster, better conditioned, better fed, and have access to better medicine than their counterparts from a century ago.

If we really believe that, and if we follow through on that belief by adding a timeline adjustment for all players, the heroes from the early 1900s wouldn't look so good. We could end up turning Ty Cobb into Lenny Dykstra and Babe Ruth into Bobby Abreu. And we don't think that would make for a very interesting disk.

Finally, when we looked at the results, we didn't see a compelling need to discount those early performances. It's not as if those players wound up dominating our rankings. In fact, we found that all eras were well represented at the top of our leader boards, suggesting that there are no serious era-based biases in our method.

It's true that several of the top batting averages on this disk belong to players from the dead-ball era, but that doesn't mean those players are overrated. Players from other eras have greater overall value because they supplement their batting averages with more power, higher walk rates, or both.

Using our approach, Ty Cobb looks like the Ty Cobb we've all read about. He's got the best projected batting average on the disk, good doubles and triples power, and he can run. He's not the best player on the disk, but he is in the top ten. We can live with that. In our view, that's much better than the alternatives.

At the risk of repeating ourselves, we're not claiming that this is the one right answer, and we're not at all sure there is one right way that a lot of people can agree on. Time and further study may change our thinking about standard deviations and some of the other decisions we've made.

For now, however, we're very happy with the way things turned out, and we hope you are, too.