Penalized Regression of WOWY data

Moderators: Doctor MJ, trex_8063, penbeast0, PaulieWal, Clyde Frazier

User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Penalized Regression of WOWY data 

Post#1 » by Moonbeam » Fri Jul 28, 2023 1:28 pm

In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here (Version 1.1, added August 3 2023). This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.

Version history:

Spoiler:
Version 1.1: Uploaded August 3, 2023
Version 1.0: Uploaded July 28, 2023.
User avatar
AEnigma
Assistant Coach
Posts: 4,042
And1: 5,838
Joined: Jul 24, 2022
 

Re: Penalized Regression of WOWY data 

Post#2 » by AEnigma » Fri Jul 28, 2023 2:16 pm

Outstanding work. Hope somehow you are able to leverage this into a formal publication. I love the comparative graphs you put together (sad to see no Lanier though ;) ). Regarding the correlation to RAPM, definitely a step down from net on/off (which is itself a step down from RAPM), but in the absence of on/off and sufficient RAPM samples for all but a few players from that era, I think this makes for a strong step forward in assessing pre-databall player impact. Thank you for putting this together!
MyUniBroDavis wrote:Some people are clearly far too overreliant on data without context and look at good all in one or impact numbers and get wowed by that rather than looking at how a roster is actually built around a player
User avatar
WestGOAT
Veteran
Posts: 2,589
And1: 3,497
Joined: Dec 20, 2015

Re: Penalized Regression of WOWY data 

Post#3 » by WestGOAT » Fri Jul 28, 2023 2:33 pm

Thanks for putting this all together so quickly, and especially for summarizing the methodology in a .pdf.

This looks like a great exercise to get more familiar with this type of modelling so I am very interested in just simply replicating this myself. I have a couple of questions related to this after having a quick look:

    - Could you share the "Games [league] [year].txt" files you used in this 5-year window from 1982-86?
    - How did you define "WOWYMatrix"? Is this a dataframe you loaded somewhere earlier?
    - Similarly, how is "homemargins$Margin" defined?

It's great that you had a critical look at the models you built, who knows maybe Ed Nealy was truly underrated during his career 8-). Looking forward to reading this more comprehensively !

General question, and probably very naive of me, but is it possible to also calculate coefficients for players using non-linear models?
Image
spotted in Bologna
User avatar
Jaivl
Head Coach
Posts: 7,023
And1: 6,684
Joined: Jan 28, 2014
Location: A Coruña, Spain
Contact:
   

Re: Penalized Regression of WOWY data 

Post#4 » by Jaivl » Fri Jul 28, 2023 2:45 pm

Dude.
This place is a cesspool of mindless ineptitude, mental decrepitude, and intellectual lassitude. I refuse to be sucked any deeper into this whirlpool of groupthink sewage. My opinions have been expressed. I'm going to go take a shower.
User avatar
homecourtloss
RealGM
Posts: 11,275
And1: 18,686
Joined: Dec 29, 2012

Re: Penalized Regression of WOWY data 

Post#5 » by homecourtloss » Fri Jul 28, 2023 5:01 pm

Moonbeam wrote:.


This is incredible work. I’ve been reading the paper instead of working this morning :lol: I hope this can stickied.

AEnigma wrote:Outstanding work. Hope somehow you are able to leverage this into a formal publication. I love the comparative graphs you put together (sad to see no Lanier though ;) ). Regarding the correlation to RAPM, definitely a step down from net on/off (which is itself a step down from RAPM), but in the absence of on/off and sufficient RAPM samples for all but a few players from that era, I think this makes for a strong step forward in assessing pre-databall player impact. Thank you for putting this together!


Looks like it. Great to see new information added to the discussion. For one thing, makes me want to look at a few new things when watching older games.

Jaivl wrote:Dude.


Basically :lol:
lessthanjake wrote:Kyrie was extremely impactful without LeBron, and basically had zero impact whatsoever if LeBron was on the court.

lessthanjake wrote: By playing in a way that prevents Kyrie from getting much impact, LeBron ensures that controlling for Kyrie has limited effect…
DraymondGold
Senior
Posts: 587
And1: 747
Joined: May 19, 2022

Re: Penalized Regression of WOWY data 

Post#6 » by DraymondGold » Fri Jul 28, 2023 9:37 pm

Moonbeam wrote:In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here. This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.
Hi Moonbeam, fascinating stuff! I've added a link to this thread in the PC Board Projects (~Unofficial Projects~) list. :D

I have a few slightly more technical questions. In lieu of asking you to do this calculation for a lot more timespans or minute thresholds (and because I thought it might be a fun exercise for me), would you be willing to share your code, either here or in DMs? If I get the code working, it's possible I might post want to post the results here in e.g. in future Top 100 threads (if that's okay with you!), but I would obviously give you full credit for the methodology and code if I did.

It looks like you coded it in R? Not a language I'm super familiar in, but trying my best to translate. At the moment I've gotten Step 1 working by scraping the 'Schedule and Results' pages (https://www.basketball-reference.com/leagues/NBA_1980_games-october.html), but I'm trying to figure out how to do Step 2 and Step 3 (I'm starting with Unpenalized RWOWY)

For Step 2:
-where on basketball reference do you scrape the player data, to check e.g. who played more than 18 minutes? I imagine it would be links like here (https://www.basketball-reference.com/leagues/NBA_1982_per_game.html)?
-Where do you get the game data for each player: Perhaps using links like this, https://www.basketball-reference.com/players/a/abdulka01/gamelog/1982, where you have some code that takes payer name from the set of players who were 18+ minutes then and generates the url nickname for each player 'abdulka01' ? The player Game Logs seem to not always include rows when a player missed games: e.g. that Kareem page only has 72 rows for 72 games without including any indicator of which 4 games he missed, so presumably you also need some sort of code to 1) figure out which games were missed / which rows were missing, and 2) figure out which games on the Game Log were home and away games, presumably both done by referencing the Game Margins table.

Thanks again for this work! Any help here would be appreciated.
User avatar
WestGOAT
Veteran
Posts: 2,589
And1: 3,497
Joined: Dec 20, 2015

Re: Penalized Regression of WOWY data 

Post#7 » by WestGOAT » Fri Jul 28, 2023 11:31 pm

DraymondGold wrote:
Moonbeam wrote:In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here. This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.
Hi Moonbeam, fascinating stuff! I've added a link to this thread in the PC Board Projects (~Unofficial Projects~) list. :D

I have a few slightly more technical questions. In lieu of asking you to do this calculation for a lot more timespans or minute thresholds (and because I thought it might be a fun exercise for me), would you be willing to share your code, either here or in DMs? If I get the code working, it's possible I might post want to post the results here in e.g. in future Top 100 threads (if that's okay with you!), but I would obviously give you full credit for the methodology and code if I did.

It looks like you coded it in R? Not a language I'm super familiar in, but trying my best to translate. At the moment I've gotten Step 1 working by scraping the 'Schedule and Results' pages (https://www.basketball-reference.com/leagues/NBA_1980_games-october.html), but I'm trying to figure out how to do Step 2 and Step 3 (I'm starting with Unpenalized RWOWY)

For Step 2:
-where on basketball reference do you scrape the player data, to check e.g. who played more than 18 minutes? I imagine it would be links like here (https://www.basketball-reference.com/leagues/NBA_1982_per_game.html)?
-Where do you get the game data for each player: Perhaps using links like this, https://www.basketball-reference.com/players/a/abdulka01/gamelog/1982, where you have some code that takes payer name from the set of players who were 18+ minutes then and generates the url nickname for each player 'abdulka01' ? The player Game Logs seem to not always include rows when a player missed games: e.g. that Kareem page only has 72 rows for 72 games without including any indicator of which 4 games he missed, so presumably you also need some sort of code to 1) figure out which games were missed / which rows were missing, and 2) figure out which games on the Game Log were home and away games, presumably both done by referencing the Game Margins table.

Thanks again for this work! Any help here would be appreciated.


For step 2, I think you might need to collect box-score data and compile them in a single table. The schedule provides links for each box-score: https://www.basketball-reference.com/boxscores/198110300LAL.html

This way you can also filter on minutes played. Not sure if there is an easier way, but that should work.
Image
spotted in Bologna
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#8 » by Moonbeam » Sat Jul 29, 2023 12:15 am

AEnigma wrote:Outstanding work. Hope somehow you are able to leverage this into a formal publication. I love the comparative graphs you put together (sad to see no Lanier though ;) ). Regarding the correlation to RAPM, definitely a step down from net on/off (which is itself a step down from RAPM), but in the absence of on/off and sufficient RAPM samples for all but a few players from that era, I think this makes for a strong step forward in assessing pre-databall player impact. Thank you for putting this together!


Thanks! I agree that there's some value in this for pre-databall players. I'm going to examine some of the potential tweaks I've mentioned to see if it can't be improved a bit.

Here's a graph with Lanier instead of Parish with those 70s centers. I'm happy to make graphs with other comparisons you or others may be interested in as well.

Image
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#9 » by Moonbeam » Sat Jul 29, 2023 12:24 am

WestGOAT wrote:Thanks for putting this all together so quickly, and especially for summarizing the methodology in a .pdf.

This looks like a great exercise to get more familiar with this type of modelling so I am very interested in just simply replicating this myself. I have a couple of questions related to this after having a quick look:

    - Could you share the "Games [league] [year].txt" files you used in this 5-year window from 1982-86?
    - How did you define "WOWYMatrix"? Is this a dataframe you loaded somewhere earlier?
    - Similarly, how is "homemargins$Margin" defined?

It's great that you had a critical look at the models you built, who knows maybe Ed Nealy was truly underrated during his career 8-). Looking forward to reading this more comprehensively !

General question, and probably very naive of me, but is it possible to also calculate coefficients for players using non-linear models?


Great that you are interested in replicating this! This is something I'd certainly encourage. If others may also be interested, I'll look to see if I can share some stuff via Github or some other repository, but that may take a bit for me to get set up, so feel free to send me a PM in the meantime and I could send via email.

All of this is coded in R. WOWYMatrix is a function I wrote to create the matrix of 1s, 0s, and -1s for players. In truth, there are lots of functions I wrote to make the code not appear so massive --- there's an "NBA Functions.R" file that I have which has functions for this and other stuff I've been working on which is 2661 lines long. :lol: I'd guess the functions for this project make up roughly 500 of those lines. The homemargins object is called a 'data frame' in R, which is a structure of rows and columns that can store categorical and numerical data. To call a specific column, you use the $ sign, so homemargins$Margin is simply taking the computed 'Margin' column from the larger homemargins data frame.

Regarding the modelling, thanks for asking this! As I imagine these sorts of comparisons are predominantly used for individual player evaluation, a linear model makes sense. A linear model is indeed quite simple. It is essentially trying to find the individual impact of each player, assuming that the impact of each player is merely additive to the team and therefore independent of the other players on the floor. If we wanted to determine the impact of 5-player lineups, a more complex structure would be useful.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#10 » by Moonbeam » Sat Jul 29, 2023 12:29 am

DraymondGold wrote:
Moonbeam wrote:In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here. This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.
Hi Moonbeam, fascinating stuff! I've added a link to this thread in the PC Board Projects (~Unofficial Projects~) list. :D

I have a few slightly more technical questions. In lieu of asking you to do this calculation for a lot more timespans or minute thresholds (and because I thought it might be a fun exercise for me), would you be willing to share your code, either here or in DMs? If I get the code working, it's possible I might post want to post the results here in e.g. in future Top 100 threads (if that's okay with you!), but I would obviously give you full credit for the methodology and code if I did.

It looks like you coded it in R? Not a language I'm super familiar in, but trying my best to translate. At the moment I've gotten Step 1 working by scraping the 'Schedule and Results' pages (https://www.basketball-reference.com/leagues/NBA_1980_games-october.html), but I'm trying to figure out how to do Step 2 and Step 3 (I'm starting with Unpenalized RWOWY)

For Step 2:
-where on basketball reference do you scrape the player data, to check e.g. who played more than 18 minutes? I imagine it would be links like here (https://www.basketball-reference.com/leagues/NBA_1982_per_game.html)?
-Where do you get the game data for each player: Perhaps using links like this, https://www.basketball-reference.com/players/a/abdulka01/gamelog/1982, where you have some code that takes payer name from the set of players who were 18+ minutes then and generates the url nickname for each player 'abdulka01' ? The player Game Logs seem to not always include rows when a player missed games: e.g. that Kareem page only has 72 rows for 72 games without including any indicator of which 4 games he missed, so presumably you also need some sort of code to 1) figure out which games were missed / which rows were missing, and 2) figure out which games on the Game Log were home and away games, presumably both done by referencing the Game Margins table.

Thanks again for this work! Any help here would be appreciated.


Yep, this is all coded in R. I'm happy to share the code, though I'd have to modify it slightly to take out references to my folder directory. Having between 900 and 1300 box scores for each season from 1977 in a single folder along with other files would make for one large folder, haha. I'll see what I can set up in the way of a repository.

For Step 2, I had previously downloaded (by hand!) all of the available player data for the regular season and playoffs via BBR's export functionality, but you could hypothetically use a web scraper for this. For the box score data, you are right that player game logs exclude rows sometimes for games players have missed, but the date is in the "in" rows for these players and could be compared to a full schedule. But I've actually gotten box score data from BBR for this purpose to make it easier.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#11 » by Moonbeam » Sat Jul 29, 2023 12:35 am

One thing that could be another challenge data-wise is how to reconcile the ABA and the NBA. I'm not sure the degree of crossover between the leagues throughout the ABA's existence, but if it's not sufficiently large, it could lead to an extreme swing for one league at the expense of the other. For longer windows, it may not be as much of a problem, particularly for periods that overlap seasons with only 1 league in existence, but it will be interesting to compare the results of ABA 1972-76, NBA 1972-76, and ABA+NBA 1972-76, for instance.
User avatar
AEnigma
Assistant Coach
Posts: 4,042
And1: 5,838
Joined: Jul 24, 2022
 

Re: Penalized Regression of WOWY data 

Post#12 » by AEnigma » Sat Jul 29, 2023 3:59 pm

Moonbeam wrote:Here's a graph with Lanier instead of Parish with those 70s centers.

Image

As a responsible stat-user, I look forward to spamming this in my nomination/vote posts. :nod:

I'm happy to make graphs with other comparisons you or others may be interested in as well.

One bit I noticed was that your regression is to some extent “lower” on Sidney Moncrief than the regressions ElGee did. Some tradeoff there is that Paul Pressey looks a lot more impressive by your method, and separately, so does Dennis Johnson ends. Because of how a lot of these discussions go, a lot of people who cited those WOWYR values as support for this idea of Moncrief as a nearly unparalleled guard defender may find themselves considering Dennis Johnson and Paul Pressey instead.

(For my part, I have always been high on Pressey — although as more of a short peak/prime third option — and to whatever extent someone would argue Moncrief as being more impactful than DJ, I tended to attribute that more to the scoring disparity between the two.)

So with that and the top 100 project in mind, I would be curious to see the regression graph with Calvin Murphy, Don Buse, Gus Williams, and Micheal Ray Richardson — whenever is convenient for you.
MyUniBroDavis wrote:Some people are clearly far too overreliant on data without context and look at good all in one or impact numbers and get wowed by that rather than looking at how a roster is actually built around a player
Doctor MJ
Senior Mod
Senior Mod
Posts: 52,686
And1: 21,622
Joined: Mar 10, 2005
Location: Cali
     

Re: Penalized Regression of WOWY data 

Post#13 » by Doctor MJ » Sat Jul 29, 2023 8:25 pm

So cool! Very interested in more of this work.

Looking at graphs, and please do let me know folks if I read anything wrong.

- Magic & Bird look completely unlike the rest as outliers - particularly critical if I'm misinterpreting the data here because it does look so impressive, and I'd like to know how significant this looks to
Moonbeam wrote:.
.

- The way Parish tops other '70s->80s big so consistently, and the way Kareem is just slightly below him, compared to Moses & Artis is interesting.

- Robinson with the most consistent impact, Olajuwon with comparable run, Ewing a bit under and not necessarily better longevity, and Daugherty surprisingly getting all the way up there with the rest at his best.

- Okay now Bobby Jones. Not exactly a shock when his impact looks big in any per-possession metric, but if I'm understanding your approach here, this isn't a per-possession thing (rather there's a minute threshold). This would seem to imply that it might be even more impressive that Jones was able to do show these numbers despite playing less minutes.

- Lucas & Hayes kinda go like I'd expect.

- Um, the 80s-90s power forward graph is really something. McHale looks like I'd expect - great, but short. Barkley topping Malone for most of their career is NOT what I would expect. Nance looking like he might be the best of the bunch makes me smile. Wow.

- 90s point guards seems about right. All peaking around the same levels, but with clear longevity advantage Stockton > Payton > KJ > Price. None are up ever up there with the way Magic (and Bird) consistently were.

- The '80s scorers reinforces certain ideas about about King - high, short peak - English - less high, but longer - Dantley - capable of solid if not worldbeating impact, but often not really. Interesting seeing Marques. Thought he might look better at peak.

- Small forwards. I'm not surprised Pippen comes out on top, nor that Mullin has greater longevity than the other two, interesting the way Worthy & Wilkins seem to trade places.

- '80s-90s shooting guards gives a clear pecking order of Jordan > Drexler > Dumars, but it's interesting that the gap between them is typically small compared to that between Dumars and Richmond.

- Man look at DJ! Isiah still probably has the most area on his graph, but DJ seems to have the strongest prime here and he's doing this with so much team success. Very interesting for a guy who placed high on the 2006 & 2008 100's and then has been absent the last 4 iterations while his teammates Sikma & Williams have been regulars. I may feel a need to champion DJ in this 100.

- Moncrief looks legit like we'd expect - with limited longevity. Cheeks looks good and with a better longevity, but does feel like the Ringo of the 4.

- I'm really surprised that Erving looks so much stronger than Gervin. I'm a much bigger fan of Erving, believe his peak impact to be huge, and honestly would have expect Gervin > Erving over their NBA run.

I'm not sure what specific queries I'd have. I'd love to gain better visibility to the data beneath. Short of that I'd just like to see more of the stuff that stands out to you Moonbeam.
Getting ready for the RealGM 100 on the PC Board

Come join the WNBA Board if you're a fan!
OhayoKD
Lead Assistant
Posts: 5,899
And1: 3,848
Joined: Jun 22, 2022
 

Re: Penalized Regression of WOWY data 

Post#14 » by OhayoKD » Sat Jul 29, 2023 9:31 pm

Moonbeam wrote:In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here. This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.

would it be greedy of me to ask for a graph charting magic, bird, hakeem, and jordan specifically?
its my last message in this thread, but I just admit, that all the people, casual and analytical minds, more or less have consencus who has the weight of a rubberized duck. And its not JaivLLLL
lessthanjake
Veteran
Posts: 2,813
And1: 2,552
Joined: Apr 13, 2013

Re: Penalized Regression of WOWY data 

Post#15 » by lessthanjake » Sun Jul 30, 2023 1:37 am

Interesting stuff, and I think is very similar to Thinking Basketball’s WOWYR.

It feels to me like there’s really just not a lot of data for a lot of this though, particularly in these past eras where players missed very few games. For instance, Doc mentioned being surprised about Karl Malone being below Charles Barkley, so I’ll use Karl Malone as an example. Karl Malone is well below Barkley in the 1989-1993 timeframe. But what is the data for Karl Malone’s 1989-1993 timeline based on? Well, here’s the list of number of missed games by players who played over 18 minutes a game in a season for the Jazz in that time period (in years they actually averaged 18+ minutes per game):

Karl Malone: 3
John Stockton: 4
Mark Eaton: 3
Thurl Bailey: 0
Darrell Griffith: 0
Bob Hansen: 37
Blue Edwards: 22
Jeff Malone: 17
Tyrone Corbin: 13
Mike Brown: 0
Jay Humphries: 4
David Benoit: 0
Larry Krystkowiak: 11

So, at least as I understand it (and I’ve admittedly not read through the actual paper, so sorry if I’m misinterpreting anything!), the model is basically trying to figure out Karl Malone’s impact essentially based on regressing what occurred in those missed games. There’s some missed games there, but nothing particularly substantial for anyone and the most substantial number of missed games are from relatively minor players. I don’t really see how a WOWY-based model can in any way accurately assess Karl Malone’s impact without almost any missed games from Malone, Stockton, or Eaton, and with several other relevant players having 0 missed games at all. It seems like the results would inevitably just be based on statistical noise centered largely around what randomly happened to occur when pretty inconsequential players like Bob Hansen and Blue Edwards were out.

Another example of this is the 1989-1993 timeframe for Jordan. He does fairly well in this timeframe, but what is the data based on? Here’s the total missed games of people on the Bulls who played 18 MPG in a given season in that timeframe:

Michael Jordan: 7
Scottie Pippen: 10
Horace Grant: 14
BJ Armstrong: 0
Bill Cartwright: 55
Scott Williams: 11
John Paxson: 7
Stacey King: 0
Craig Hodges: 33
Sam Vincent: 12
Brad Sellers: 2
Dave Corzine: 1

There’s basically virtually zero missed-game data there, except for what happened in a bunch of missed games from Bill Cartwright and Craig Hodges. Players like that don’t *really* affect games that much, but when they make up a huge portion of the teams’ missed games, what randomly happens to occur in missed games by players like that can really skew a model like this. For instance, we see above that Craig Hodges missed 33 games in years he played 18+ MPG. This was all in the 1989 season. And, based on the charts provided, we actually see Jordan’s rating in this measure tank from the 1984-1988 time period to the 1985-1989 time period and he didn’t get super high until 1989 was out of the time period, so it seems reasonably obvious that something happened in 1989 that tanked his rating. The only person that missed a lot of games that season was Craig Hodges. The Bulls happened to go 32-17 in the games Craig Hodges played and 15-18 in the games Hodges Missed (and I’m sure the difference in average margin of victory is pretty significant too). So my guess is that the model thinks Craig Hodges was really impactful (and his missed games make up a significant portion of the entire set of missed games that’s being regressed), so what happened in those games has a significant impact on Jordan’s perceived impact in time periods that contain that year (and note that Pippen dropped that same year too—though a bit less, probably because he missed several games that Hodges missed too).
OhayoKD wrote:Lebron contributes more to all the phases of play than Messi does. And he is of course a defensive anchor unlike messi.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#16 » by Moonbeam » Sun Jul 30, 2023 9:24 am

AEnigma wrote:
Moonbeam wrote:Here's a graph with Lanier instead of Parish with those 70s centers.

Image

As a responsible stat-user, I look forward to spamming this in my nomination/vote posts. :nod:

I'm happy to make graphs with other comparisons you or others may be interested in as well.

One bit I noticed was that your regression is to some extent “lower” on Sidney Moncrief than the regressions ElGee did. Some tradeoff there is that Paul Pressey looks a lot more impressive by your method, and separately, so does Dennis Johnson ends. Because of how a lot of these discussions go, a lot of people who cited those WOWYR values as support for this idea of Moncrief as a nearly unparalleled guard defender may find themselves considering Dennis Johnson and Paul Pressey instead.

(For my part, I have always been high on Pressey — although as more of a short peak/prime third option — and to whatever extent someone would argue Moncrief as being more impactful than DJ, I tended to attribute that more to the scoring disparity between the two.)

So with that and the top 100 project in mind, I would be curious to see the regression graph with Calvin Murphy, Don Buse, Gus Williams, and Micheal Ray Richardson — whenever is convenient for you.


No problem at all --- it just takes one line of code to make these graphs, so feel free to ask away!

Image

Note I've got box scores dating back to 1952 now. These results combine the ABA and NBA from the 1967-68 season through the 1975-76 season. I still want to compare the combined results to those using only the ABA or only the NBA, but a few initial player glimpses makes it seem like the combined leagues don't produce results that are too wonky.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#17 » by Moonbeam » Sun Jul 30, 2023 9:26 am

Doctor MJ wrote:So cool! Very interested in more of this work.

Looking at graphs, and please do let me know folks if I read anything wrong.

- Magic & Bird look completely unlike the rest as outliers - particularly critical if I'm misinterpreting the data here because it does look so impressive, and I'd like to know how significant this looks to
Moonbeam wrote:.
.

- The way Parish tops other '70s->80s big so consistently, and the way Kareem is just slightly below him, compared to Moses & Artis is interesting.

- Robinson with the most consistent impact, Olajuwon with comparable run, Ewing a bit under and not necessarily better longevity, and Daugherty surprisingly getting all the way up there with the rest at his best.

- Okay now Bobby Jones. Not exactly a shock when his impact looks big in any per-possession metric, but if I'm understanding your approach here, this isn't a per-possession thing (rather there's a minute threshold). This would seem to imply that it might be even more impressive that Jones was able to do show these numbers despite playing less minutes.

- Lucas & Hayes kinda go like I'd expect.

- Um, the 80s-90s power forward graph is really something. McHale looks like I'd expect - great, but short. Barkley topping Malone for most of their career is NOT what I would expect. Nance looking like he might be the best of the bunch makes me smile. Wow.

- 90s point guards seems about right. All peaking around the same levels, but with clear longevity advantage Stockton > Payton > KJ > Price. None are up ever up there with the way Magic (and Bird) consistently were.

- The '80s scorers reinforces certain ideas about about King - high, short peak - English - less high, but longer - Dantley - capable of solid if not worldbeating impact, but often not really. Interesting seeing Marques. Thought he might look better at peak.

- Small forwards. I'm not surprised Pippen comes out on top, nor that Mullin has greater longevity than the other two, interesting the way Worthy & Wilkins seem to trade places.

- '80s-90s shooting guards gives a clear pecking order of Jordan > Drexler > Dumars, but it's interesting that the gap between them is typically small compared to that between Dumars and Richmond.

- Man look at DJ! Isiah still probably has the most area on his graph, but DJ seems to have the strongest prime here and he's doing this with so much team success. Very interesting for a guy who placed high on the 2006 & 2008 100's and then has been absent the last 4 iterations while his teammates Sikma & Williams have been regulars. I may feel a need to champion DJ in this 100.

- Moncrief looks legit like we'd expect - with limited longevity. Cheeks looks good and with a better longevity, but does feel like the Ringo of the 4.

- I'm really surprised that Erving looks so much stronger than Gervin. I'm a much bigger fan of Erving, believe his peak impact to be huge, and honestly would have expect Gervin > Erving over their NBA run.

I'm not sure what specific queries I'd have. I'd love to gain better visibility to the data beneath. Short of that I'd just like to see more of the stuff that stands out to you Moonbeam.


Yeah, Magic and Bird look outstanding here, particularly Magic. With some older box score data available now, there are a couple others who also jump out:

Image

You're also right that these aren't per-possession metrics, but per-game, so Bobby Jones standing out looks quite impressive.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#18 » by Moonbeam » Sun Jul 30, 2023 9:29 am

OhayoKD wrote:
Moonbeam wrote:In the Top 100 project, I made a post about some estimates I have calculated via penalized regression of WOWY data, which I’ve called RWOWY, RWOWY-Ridge, RWOWY-Lasso, and RWOWY-ENet: viewtopic.php?p=107785464#p107785464

A form of penalized regression is used in calculating RAPM, so the metric RWOWY-Ridge is analogous to it, except it is applied to WOWY data instead of +/- data.

Some posters were interested in the data, so I’ve put together a document explaining the methods here. This document walks through an example of calculating these estimates for a 5-year window from 1982-86 and provides a critical evaluation of the results including a comparison to 5-year RAPM, some ideas for possible extensions, and some graphs with player comparisons.

A few quick takeaways:

It’s challenging to determine which players to include in the sample due to the nature of the data. Including all players would likely make deep bench guys on good teams appear to be the most impactful players as they might tend to only play in blowouts their team won. Setting some minimum MPG threshold is one way to try to counter this, but it gives rise to other anomalies.

RWOWY-Ridge is modestly positively correlated with RAPM data for equivalent 5-year periods. The correlation is about 0.41 on average with players who played at least 5000 possessions over a 5-year period (roughly 92% of players who played at least 18 MPG in one season), but this correlation increases a little bit when looking at players with more consistent minute profiles and those who won league awards.

I’m still in the process of obtaining box scores, so I don’t have estimates for the entire history of the league yet, but I imagine I will in the next few weeks. I’d be happy to collate and share this data if there is interest.

I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.

would it be greedy of me to ask for a graph charting magic, bird, hakeem, and jordan specifically?


No problem! Happy to add more graphs if you (or anyone else) would like.

Image
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,200
And1: 5,055
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#19 » by Moonbeam » Sun Jul 30, 2023 9:36 am

lessthanjake wrote:Interesting stuff, and I think is very similar to Thinking Basketball’s WOWYR.

It feels to me like there’s really just not a lot of data for a lot of this though, particularly in these past eras where players missed very few games. For instance, Doc mentioned being surprised about Karl Malone being below Charles Barkley, so I’ll use Karl Malone as an example. Karl Malone is well below Barkley in the 1989-1993 timeframe. But what is the data for Karl Malone’s 1989-1993 timeline based on? Well, here’s the list of number of missed games by players who played over 18 minutes a game in a season for the Jazz in that time period (in years they actually averaged 18+ minutes per game):

Karl Malone: 3
John Stockton: 4
Mark Eaton: 3
Thurl Bailey: 0
Darrell Griffith: 0
Bob Hansen: 37
Blue Edwards: 22
Jeff Malone: 17
Tyrone Corbin: 13
Mike Brown: 0
Jay Humphries: 4
David Benoit: 0
Larry Krystkowiak: 11

So, at least as I understand it (and I’ve admittedly not read through the actual paper, so sorry if I’m misinterpreting anything!), the model is basically trying to figure out Karl Malone’s impact essentially based on regressing what occurred in those missed games. There’s some missed games there, but nothing particularly substantial for anyone and the most substantial number of missed games are from relatively minor players. I don’t really see how a WOWY-based model can in any way accurately assess Karl Malone’s impact without almost any missed games from Malone, Stockton, or Eaton, and with several other relevant players having 0 missed games at all. It seems like the results would inevitably just be based on statistical noise centered largely around what randomly happened to occur when pretty inconsequential players like Bob Hansen and Blue Edwards were out.

Another example of this is the 1989-1993 timeframe for Jordan. He does fairly well in this timeframe, but what is the data based on? Here’s the total missed games of people on the Bulls who played 18 MPG in a given season in that timeframe:

Michael Jordan: 7
Scottie Pippen: 10
Horace Grant: 14
BJ Armstrong: 0
Bill Cartwright: 55
Scott Williams: 11
John Paxson: 7
Stacey King: 0
Craig Hodges: 33
Sam Vincent: 12
Brad Sellers: 2
Dave Corzine: 1

There’s basically virtually zero missed-game data there, except for what happened in a bunch of missed games from Bill Cartwright and Craig Hodges. Players like that don’t *really* affect games that much, but when they make up a huge portion of the teams’ missed games, what randomly happens to occur in missed games by players like that can really skew a model like this. For instance, we see above that Craig Hodges missed 33 games in years he played 18+ MPG. This was all in the 1989 season. And, based on the charts provided, we actually see Jordan’s rating in this measure tank from the 1984-1988 time period to the 1985-1989 time period and he didn’t get super high until 1989 was out of the time period, so it seems reasonably obvious that something happened in 1989 that tanked his rating. The only person that missed a lot of games that season was Craig Hodges. The Bulls happened to go 32-17 in the games Craig Hodges played and 15-18 in the games Hodges Missed (and I’m sure the difference in average margin of victory is pretty significant too). So my guess is that the model thinks Craig Hodges was really impactful (and his missed games make up a significant portion of the entire set of missed games that’s being regressed), so what happened in those games has a significant impact on Jordan’s perceived impact in time periods that contain that year (and note that Pippen dropped that same year too—though a bit less, probably because he missed several games that Hodges missed too).


Thanks for the comment and getting into the detail. You're right in that the utility of these metrics are limited due to the nature of the data we have available --- I can't and wouldn't shy away from that. I wouldn't feel like I was being responsible if I just pushed out these numbers without some important caveats like that as well as others in the document I shared. I do think there's some value in this, though.

Speaking of your specific examples, there is a bit more to it than that. These models are still making use of data with all of the players healthy to form a sort of baseline, so it's not like the "With" data doesn't matter --- it still does. Moreover, when players leave a team, they would be considered "missing" for those seasons. In your Bulls example, for instance, Craig Hodges would be listed as missing for the entirety of the 1992-93 season with respect to being Jordan's teammate. Stacey King would be considered missing for all but the 1989-90 season due to the minutes threshold. Sam Vincent would be considered missing for all but the 1988-89 season. And so on and so forth. This extra missingness allows for transactions between seasons to inform the estimates a bit more than merely looking at missed games. I'll note that for these players, their time away from Chicago would inform their baseline impact. Sam Vincent's impact with Orlando for the 1989-90, 1990-91, and 1991-92 seasons will inform his baseline impact and therefore inform his contribution to the 1988-89 Bulls' scoring margins.

Hopefully this helps clear it up a bit.

I'm happy for you or anyone else to ask more questions or offer more critiques. They may help me improve these metrics going forward!
OhayoKD
Lead Assistant
Posts: 5,899
And1: 3,848
Joined: Jun 22, 2022
 

Re: Penalized Regression of WOWY data 

Post#20 » by OhayoKD » Sun Jul 30, 2023 9:48 am

Looking at these results...
Moonbeam wrote:
OhayoKD wrote:I’m happy to take any feedback you may have or ideas for modifications I haven’t considered.

would it be greedy of me to ask for a graph charting magic, bird, hakeem, and jordan specifically?


No problem! Happy to add more graphs if you (or anyone else) would like.

Image[/quote][/quote]
Possible takeaways(not sure how much I should weigh this, but it seems like a promising approach)
-> Magic potentially the true "impact king"
-> MJ's era-relative impact peak might actually be during the 2nd-three-peat(expansion goes brr)
-> Delta between 80's Hakeem and "peak" Hakeem overplayed?

I imagine there is still some box-bias in these results but I'm guessing it's made up for by more stable adjustments...
Moonbeam wrote:
lessthanjake wrote:Interesting stuff, and I think is very similar to Thinking Basketball’s WOWYR.

It feels to me like there’s really just not a lot of data for a lot of this though, particularly in these past eras where players missed very few games. For instance, Doc mentioned being surprised about Karl Malone being below Charles Barkley, so I’ll use Karl Malone as an example. Karl Malone is well below Barkley in the 1989-1993 timeframe. But what is the data for Karl Malone’s 1989-1993 timeline based on? Well, here’s the list of number of missed games by players who played over 18 minutes a game in a season for the Jazz in that time period (in years they actually averaged 18+ minutes per game):

Karl Malone: 3
John Stockton: 4
Mark Eaton: 3
Thurl Bailey: 0
Darrell Griffith: 0
Bob Hansen: 37
Blue Edwards: 22
Jeff Malone: 17
Tyrone Corbin: 13
Mike Brown: 0
Jay Humphries: 4
David Benoit: 0
Larry Krystkowiak: 11

So, at least as I understand it (and I’ve admittedly not read through the actual paper, so sorry if I’m misinterpreting anything!), the model is basically trying to figure out Karl Malone’s impact essentially based on regressing what occurred in those missed games. There’s some missed games there, but nothing particularly substantial for anyone and the most substantial number of missed games are from relatively minor players. I don’t really see how a WOWY-based model can in any way accurately assess Karl Malone’s impact without almost any missed games from Malone, Stockton, or Eaton, and with several other relevant players having 0 missed games at all. It seems like the results would inevitably just be based on statistical noise centered largely around what randomly happened to occur when pretty inconsequential players like Bob Hansen and Blue Edwards were out.

Another example of this is the 1989-1993 timeframe for Jordan. He does fairly well in this timeframe, but what is the data based on? Here’s the total missed games of people on the Bulls who played 18 MPG in a given season in that timeframe:

Michael Jordan: 7
Scottie Pippen: 10
Horace Grant: 14
BJ Armstrong: 0
Bill Cartwright: 55
Scott Williams: 11
John Paxson: 7
Stacey King: 0
Craig Hodges: 33
Sam Vincent: 12
Brad Sellers: 2
Dave Corzine: 1

There’s basically virtually zero missed-game data there, except for what happened in a bunch of missed games from Bill Cartwright and Craig Hodges. Players like that don’t *really* affect games that much, but when they make up a huge portion of the teams’ missed games, what randomly happens to occur in missed games by players like that can really skew a model like this. For instance, we see above that Craig Hodges missed 33 games in years he played 18+ MPG. This was all in the 1989 season. And, based on the charts provided, we actually see Jordan’s rating in this measure tank from the 1984-1988 time period to the 1985-1989 time period and he didn’t get super high until 1989 was out of the time period, so it seems reasonably obvious that something happened in 1989 that tanked his rating. The only person that missed a lot of games that season was Craig Hodges. The Bulls happened to go 32-17 in the games Craig Hodges played and 15-18 in the games Hodges Missed (and I’m sure the difference in average margin of victory is pretty significant too). So my guess is that the model thinks Craig Hodges was really impactful (and his missed games make up a significant portion of the entire set of missed games that’s being regressed), so what happened in those games has a significant impact on Jordan’s perceived impact in time periods that contain that year (and note that Pippen dropped that same year too—though a bit less, probably because he missed several games that Hodges missed too).


Thanks for the comment and getting into the detail. You're right in that the utility of these metrics are limited due to the nature of the data we have available --- I can't and wouldn't shy away from that. I wouldn't feel like I was being responsible if I just pushed out these numbers without some important caveats like that as well as others in the document I shared. I do think there's some value in this, though.

Speaking of your specific examples, there is a bit more to it than that. These models are still making use of data with all of the players healthy to form a sort of baseline, so it's not like the "With" data doesn't matter --- it still does. Moreover, when players leave a team, they would be considered "missing" for those seasons. In your Bulls example, for instance, Craig Hodges would be listed as missing for the entirety of the 1992-93 season with respect to being Jordan's teammate. Stacey King would be considered missing for all but the 1989-90 season due to the minutes threshold. Sam Vincent would be considered missing for all but the 1988-89 season. And so on and so forth. This extra missingness allows for transactions between seasons to inform the estimates a bit more than merely looking at missed games. I'll note that for these players, their time away from Chicago would inform their baseline impact. Sam Vincent's impact with Orlando for the 1989-90, 1990-91, and 1991-92 seasons will inform his baseline impact and therefore inform his contribution to the 1988-89 Bulls' scoring margins.

Hopefully this helps clear it up a bit.

I'm happy for you or anyone else to ask more questions or offer more critiques. They may help me improve these metrics going forward!

Correct me if I'm wrong, but aren't you also using the internal box-scaling of teammates to "stabalize" the samples similar to the way a LEBRON or EPM or RPM would?

Honestly, all considered I wouldn't be shocked if this graded out as "industry standard" for pre-data ball RAPM approximation. Someone should definitely share this with Ben.
its my last message in this thread, but I just admit, that all the people, casual and analytical minds, more or less have consencus who has the weight of a rubberized duck. And its not JaivLLLL

Return to Player Comparisons