Some Regression Results...I finally got around to doing the regression I said I would in the first couple threads. What I did was very close to WOWYR (which you can read about
here and
here) but with four differences:
1. The lambda was chosen using generalized cross validation rather than k-fold cross validation.
2. By accident, I am missing any games from June, which is the finals and sometimes parts of the conference finals. I don't expect this to make a difference in the results, though I plan to eventually re-scrape for all the missing June data, and if there is a difference I'll be sure to make a post about it.
3. If you read Elgee's post on WOWYR, you would know it excludes players who do not meet certain minute thresholds. I do the same, though I made 25 minutes a strict requirement, rather than allowing for sub-cases where players playing fewer minutes can find a place in the regression.
4. Perhaps the biggest difference, I still use each player's MPG for the season, instead of MP each game, post-1984. I did this because I noticed the top players in Elgee's results are dominated by those who played after 1984. I became suspicious that the regression was more likely to underestimate player impact than overestimate, which we expect given the regularization, and thus by providing more granular data the regression can make better predictions which are not underestimations to the same degree.
Note: The results I arrived at are still dominated by players who played after 1984, but in comparison, their fitted values are not so clearly ahead of those for player's who played before 1984
One of the first things I did was run a regression on the whole data set from 1957 to 2017. This has advantages and disadvantages:
1. The advantage is the regression has plenty of data on all the players. Consistently, when doing the regression over smaller sets of years, the players at the top are those who played just one or two seasons. This also has consequences on players who played a reasonable number of seasons during the time frame, since their values are influenced by every one else's, and it's possible to find examples of including or excluding single seasons radically changing the fitted values for some players largely because what the regression thinks of the player's teammates changes.
2. The disadvantage is two fold. First, we are not isolating each player's prime. Jordan will get knocked for his Wizard years, Kareem and Duncan for their last couple seasons, Kobe for the start and end of his career, and so on. Second, the regression will assign only a single value for each player. Thus, players who played with very young or very old teammates, who became or were once great, will be knocked as well.
Here are the top-40, with players who played less than six full seasons not included.
Some general comments:
- I won't dwell on it long, since topic has passed, but I feel obligated to mention it. For all the people saying that Kareem wasn't impactful, he's second all-time in this regression, and shows similarly well in regressions focused around specific player's primes. Aside from a couple ten-game samples there was never evidence to the contrary.
- Like WOWYR, we have a good number of unexpected names in the top-40, but otherwise we see all-time greats. The number of strange results that I got doesn't seem to be any more than what Elgee got which is reassuring.
- Oscar is far ahead of the pack. What's remarkable is he looks this great even if you don't include his Bucks years. My suspicion was that he was getting a massive boost from when he joined the Bucks and their SRS shot up but that wasn't the case. I'm surprised there hasn't been more support for him thus far.
A Problem...So, while it's nice to have these results, there are problems. I'll walk through an illustrative example.
Let's look at Tim Duncan's results from 1998 to 2003. He's shows up as the very best with a fitted value of 5.78813166657 which isn't too surprising. Now, what about from 1998 to 2002? Duncan still show up well, his fitted value is 4.28670048637, but he's no longer at the very top. Was Duncan's 2003 year just so good it totally changes the results? Well, if we run the regression on just the 2003 season, Duncan looks like a 3.03965743306 (though results this season seemed deflated). Okay, let's extend the sample, and run it over 2002-2003. This time Duncan appears as a 2.77350967918. You should be thinking this is quite strange.
What's going on is Tim Duncan's fitted values are being greatly affected by his teammates. In the 1998 to 2002 regression, David Robinson is a 2.94718081154, while in the 1998 to 2003 regression Robinson is a 1.35647870915, and this is because he shows up as a -1.96055137035 in the 2003 season.
Five seasons isn't a large sample so we don't want to put too much stock into the results for Duncan and Robinson I've been discussing. However, it should be recognized that a player's fitted values can depend greatly on the regression's impression of other players, and in particular teammates.
Even with a seemingly reasonable sample for some player of interest, it must be considered if the sample is large enough for the player's teammates, which makes running the regression on just a player's prime very difficult.
Now, with a larger sample regression, we don't expect this to be as much of a problem. After all, the regression will generally need to be more accurate in estimating the player who has ten years in the sample than the player with five years in order to minimize the loss function. However, this does prevent us from isolating a player's prime.
Providing an alternative that uses multiple seasons of data to yield estimates for smaller sets of seasons is something I will work on in the upcoming days. That, and measuring the error.
A Request...I'm also going to make a request for help. I would like to get my hands on some RAPM (lineup) data to test the noise. The websites that had several years available for free are down as are the data sets which fpliii uploaded a year or so ago. I have a half-made scraper to get the RAPM data, but right now I want to focus on improving my WOWYR-type model, and I have a back log of research papers that I'm reading through. In other words, I'm lacking for time, and would really appreciate a helping hand. Even just a single year would be great if some one happens to have the data on hand.