Penalized Regression of WOWY data

lessthanjake · Post #121 » by **lessthanjake** » Fri Aug 4, 2023 3:14 am

One thing I just want to flag for people is that we should be careful to look at the years for people on these charts and think about what it’s encompassing, rather than just eyeballing them and drawing an immediate conclusion, because a lot of these data points are for time periods someone didn’t actually play much in, and that effects how this stuff looks at first glance.

For instance, these charts look a bit artificially good for people who retire before they decline. If you retire, then there’s several five-year windows after your retirement that only include prime years and therefore where your score will still look really good. If you keep playing as you decline, then those subsequent five-year time periods that include post-prime years will look worse than they’d look if you’d retired, since it’s including declined years in there where your impact has declined. This ends up making people who retired early (Magic Johnson being a great example) look at first glance like they were on top of the league for way longer than they were even actually playing. If you just eyeball the charts, it looks like Magic dominated some time periods he hardly played in! So we should keep that sort of thing in mind.

Similarly, a first glance at the charts looks really hard on people who started out slow in terms of impact, even if it’s just for a year or two. Even just one season that’s weak impact-wise can prevent a given five-year time period from looking elite. And the charts go all the way back starting 4 years before someone started in the NBA (i.e. the first time period that has someone’s rookie season right at the end). So if you have a slow (or even just not stellar) rookie season impact-wise, it’ll end up being part of 5 different data points, and for all but one of them it’ll be of outsized importance (since there won’t be 5 actual full seasons played in those time periods). So a slow first season or two can tank a lot of data points, even if someone quickly got better. We see that with LeBron and others. On the other hand, people who started out great from the beginning (again, Magic Johnson being an example, and Larry Bird is another) end up looking great in those timeframes that go all the way back to years before they started playing, because their first year or two that have outsized importance in those early data points are good. Of course, starting out great is a good thing. But my point is that it ends up having an outsized effect at first glance on the graphs, since the early data points go back before someone started playing, so a really good rookie season ends up looking like a really good half-decade.

Magic Johnson is a bit of a perfect storm of both these factors. He was really good even as a rookie and he retired in his prime. And he even had that super brief comeback as a mostly bench player in 1996 that perpetuates things even further (though, to his credit, this is only because he did actually have good impact in that brief comeback). So, a first glance at the charts makes a guy who really only was a meaningful player from 1979-80 to 1990-1991 look like he was dominating the league from 1976-2000.

I think the best way to look at these charts is probably to *mostly* just zero in on the timeframes in which these players were NBA players the entire time.

homecourtloss · Post #122 » by **homecourtloss** » Fri Aug 4, 2023 4:49 am

Moonbeam wrote:
homecourtloss wrote:I feel like we’re getting an exclusive service not available anywhere else for free. Is it possible to run one for Drexler and Terry Porter?

And then one for Ewing, Pippen, Barkley, Reggie Miller, and Payton?

Here you go! Surprised Drexler outpaced Porter that much.

I am, too.

Pippen looking like a monster.

Post #123 » by **Moonbeam** » Fri Aug 4, 2023 6:09 am

Doctor MJ wrote:.

At present, I only have data back to 1952 because of the minutes requirement I'm using. I've thought about modelling minutes based on available box score stats for earlier periods so we can maybe get something for Groza, Feerick, etc.

- Rochester Royals if we can get good numbers at least back to their joining of the BAA. (NBL back to '45-46 would be amazing, but the data is super sparse)
Key players: Bob Davies, Arnie Risen, Bobby Wanzer, Jack Coleman, Arnie Johnson

- Minneapolis Lakers ideally back to their joining of the BAA.
Key players: George Mikan, Jim Pollard, Herm Schaeffer, Slater Martin, Vern Mikkelsen, Clyde Lovellette

Note: No Schaefer due to no MP.

- Syracuse Nationals
Key players: Dolph Schayes, Paul Seymour, Red Rocha, Earl Lloyd, George King, Red Kerr

- Philadelphia Warriors
Key players: Paul Arizin, Neil Johnston, Jack George, Tom Gola, Wilt Chamberlain

- Boston Celtics
Key players: Bob Cousy, Ed Macauley, Bill Sharman, Bill Russell, Tom Heinsohn, Frank Ramsey

- Boston Celtics
Key players: Bill Russell, Sam Jones, John Havlicek, KC Jones, Tom Sanders, Bailey Howell

- Boston Celtics
Key players: John Havlicek, Dave Cowens, Jo Jo White, Paul Silas, Don Chaney, Don Nelson

- St. Louis Hawks
Key players: Bob Pettit, Cliff Hagan, Lenny Wilkens, Clyde Lovellette, Zelmo Beaty, Lou Hudson

- Philadelphia 76ers
Key players: Wilt Chamberlain, Hal Greer, Chet Walker, Billy Cunningham, Luke Jackson, Wali Jones

- Los Angeles Lakers
Key players: Elgin Baylor, Jerry West, Dick Barnett, Rudy LaRusso, Wilt Chamberlain, Gail Goodrich

- New York Knicks
Key players: Walt Frazier, Willis Reed, Dave DeBusschere, Dick Barnett, Earl Monroe, Bill Bradley

- Milwaukee Bucks
Key players: Kareem Abdul-Jabbar, Oscar Robertson, Bob Dandridge, Jon McGlocklin, Greg Smith

Post #124 » by **Moonbeam** » Fri Aug 4, 2023 10:46 am

lessthanjake wrote:One thing I just want to flag for people is that we should be careful to look at the years for people on these charts and think about what it’s encompassing, rather than just eyeballing them and drawing an immediate conclusion, because a lot of these data points are for time periods someone didn’t actually play much in, and that effects how this stuff looks at first glance.

For instance, these charts look a bit artificially good for people who retire before they decline. If you retire, then there’s several five-year windows after your retirement that only include prime years and therefore where your score will still look really good. If you keep playing as you decline, then those subsequent five-year time periods that include post-prime years will look worse than they’d look if you’d retired, since it’s including declined years in there where your impact has declined. This ends up making people who retired early (Magic Johnson being a great example) look at first glance like they were on top of the league for way longer than they were even actually playing. If you just eyeball the charts, it looks like Magic dominated some time periods he hardly played in! So we should keep that sort of thing in mind.

Similarly, a first glance at the charts looks really hard on people who started out slow in terms of impact, even if it’s just for a year or two. Even just one season that’s weak impact-wise can prevent a given five-year time period from looking elite. And the charts go all the way back starting 4 years before someone started in the NBA (i.e. the first time period that has someone’s rookie season right at the end). So if you have a slow (or even just not stellar) rookie season impact-wise, it’ll end up being part of 5 different data points, and for all but one of them it’ll be of outsized importance (since there won’t be 5 actual full seasons played in those time periods). So a slow first season or two can tank a lot of data points, even if someone quickly got better. We see that with LeBron and others. On the other hand, people who started out great from the beginning (again, Magic Johnson being an example, and Larry Bird is another) end up looking great in those timeframes that go all the way back to years before they started playing, because their first year or two that have outsized importance in those early data points are good. Of course, starting out great is a good thing. But my point is that it ends up having an outsized effect at first glance on the graphs, since the early data points go back before someone started playing, so a really good rookie season ends up looking like a really good half-decade.

Magic Johnson is a bit of a perfect storm of both these factors. He was really good even as a rookie and he retired in his prime. And he even had that super brief comeback as a mostly bench player in 1996 that perpetuates things even further (though, to his credit, this is only because he did actually have good impact in that brief comeback). So, a first glance at the charts makes a guy who really only was a meaningful player from 1979-80 to 1990-1991 look like he was dominating the league from 1976-2000.

I think the best way to look at these charts is probably to *mostly* just zero in on the timeframes in which these players were NBA players the entire time.

These are fair points. It's a tough balance between getting enough interconnectedness between players to have a reasonable amount of information to inform the models with a larger window size and the lingering effects of big changes (rookie years, final years, switching teams) lasting some time. It's tough to know what the best strategy is. For some 3-year windows I've looked at, some players actually have an NA as a Ridge score due to perfect multicollinearity. Using Minutes Played in a game would get around that a little bit, but there are other drawbacks I'll mention in a separate post.

Post #125 » by **Moonbeam** » Fri Aug 4, 2023 11:02 am

I'm looking at potential modifications to see if they may offer improvements. A couple main ones I've thought about for the WOWY matrix:

* Instead of a MPG for the season threshold (e.g. 18 MPG to be included), have it be a game-by-game thing, so a player who plays >= 18 minutes in a game is counted, but those who play fewer are not
* Instead of 1 or -1, include minutes played

These are going to present some big challenges though. I've looked into the 1996-97 Utah Jazz as an example. John Stockton, Karl Malone, Antoine Carr, and Howard Eisley played all 102 games for Utah that season (82 regular season and 20 playoff games). Here's how their minutes played in games plots against Utah's margin of victory:

John Stockton: +8.7 on, +7.6 on-off

Correlation between MP and Utah's margin: -0.502

Karl Malone: +11.7 on, +21.9 on-off

Correlation between MP and Utah's margin: -0.558

Antoine Carr: -2.7 on, -14.5 on-off

Correlation between MP and Utah's margin: 0.179

Howard Eisley: +1.0 on, -7.8 on-off

Correlation between MP and Utah's margin: 0.531

So what's happening is that Utah being really good means they have more blowout wins than blowout losses, so Stockton and Malone see their fewest minutes Utah has their biggest wins. Carr and especially Eisley benefit from this by tending to play more minutes in these blowouts.

A regression model using minutes played would likely think Malone and Stockton are negative impact players because of this, and Eisley is a positive impact player. Their on-off scores tell the opposite story. Setting a minimum minute threshold of, say, 18 minutes will carve out the more competitive games where he played fewer minutes from Eisley's sample, potentially making this worse.

I could try to detect blowouts through the minute profile of the game and adjust thresholds and minutes accordingly, but it's going to take me awhile to think about.

WestGOAT · Post #126 » by **WestGOAT** » Fri Aug 4, 2023 12:46 pm

Moonbeam wrote:
Spoiler:
I'm looking at potential modifications to see if they may offer improvements. A couple main ones I've thought about for the WOWY matrix:

* Instead of a MPG for the season threshold (e.g. 18 MPG to be included), have it be a game-by-game thing, so a player who plays >= 18 minutes in a game is counted, but those who play fewer are not
* Instead of 1 or -1, include minutes played

These are going to present some big challenges though. I've looked into the 1996-97 Utah Jazz as an example. John Stockton, Karl Malone, Antoine Carr, and Howard Eisley played all 102 games for Utah that season (82 regular season and 20 playoff games). Here's how their minutes played in games plots against Utah's margin of victory:

John Stockton: +8.7 on, +7.6 on-off

Correlation between MP and Utah's margin: -0.502

Karl Malone: +11.7 on, +21.9 on-off

Correlation between MP and Utah's margin: -0.558

Antoine Carr: -2.7 on, -14.5 on-off

Correlation between MP and Utah's margin: 0.179

Howard Eisley: +1.0 on, -7.8 on-off

Correlation between MP and Utah's margin: 0.531

So what's happening is that Utah being really good means they have more blowout wins than blowout losses, so Stockton and Malone see their fewest minutes Utah has their biggest wins. Carr and especially Eisley benefit from this by tending to play more minutes in these blowouts.

A regression model using minutes played would likely think Malone and Stockton are negative impact players because of this, and Eisley is a positive impact player. Their on-off scores tell the opposite story. Setting a minimum minute threshold of, say, 18 minutes will carve out the more competitive games where he played fewer minutes from Eisley's sample, potentially making this worse.

I could try to detect blowouts through the minute profile of the game and adjust thresholds and minutes accordingly, but it's going to take me awhile to think about.

Not sure if this will work as you intend, unless you also use the actual point margins that overlapped with the specific minutes played, and then you'd basically be doing something similar to RAPM ( but instead of possessions it would be minutes?) right?

The rationale behind for taking MP into account is to better separate between role-players playing limited vs big-minute players right? Why not stick to the original WOWY matrix, but then factor the value you obtain by minutes played/48?

For example, Magic I believe had + points-margin/game (is this the right unit?) of 6? If he played 40 mpg then do 6*(40/48). In the case of Ed Nealy was it 4? If he averaged 15 mpg then 4*(15/48). If that makes sense.

edit:
okay maybe that doesn't make sense :lol:

now that I think more about it, if you want points-margin per minute in this case you have to do 6/40 for Magic and for Ed it would be 4/15. oops :lol:

eminence · Post #127 » by **eminence** » Fri Aug 4, 2023 1:19 pm

JE had similar issues with his 90s 'RAPM' and did his simulated box-score thing, but I really think that's getting too far into the weeds, I like it more as it is currently vs going the estimation within a simulation route.

How severe does the collinearity problem look at 3/4/5 year splits? I imagine below that it's extreme, and above that you're getting into career range.

Post #128 » by **Moonbeam** » Fri Aug 4, 2023 1:50 pm

WestGOAT wrote:
Moonbeam wrote:
Spoiler:
I'm looking at potential modifications to see if they may offer improvements. A couple main ones I've thought about for the WOWY matrix:

* Instead of a MPG for the season threshold (e.g. 18 MPG to be included), have it be a game-by-game thing, so a player who plays >= 18 minutes in a game is counted, but those who play fewer are not
* Instead of 1 or -1, include minutes played

These are going to present some big challenges though. I've looked into the 1996-97 Utah Jazz as an example. John Stockton, Karl Malone, Antoine Carr, and Howard Eisley played all 102 games for Utah that season (82 regular season and 20 playoff games). Here's how their minutes played in games plots against Utah's margin of victory:

John Stockton: +8.7 on, +7.6 on-off

Correlation between MP and Utah's margin: -0.502

Karl Malone: +11.7 on, +21.9 on-off

Correlation between MP and Utah's margin: -0.558

Antoine Carr: -2.7 on, -14.5 on-off

Correlation between MP and Utah's margin: 0.179

Howard Eisley: +1.0 on, -7.8 on-off

Correlation between MP and Utah's margin: 0.531

So what's happening is that Utah being really good means they have more blowout wins than blowout losses, so Stockton and Malone see their fewest minutes Utah has their biggest wins. Carr and especially Eisley benefit from this by tending to play more minutes in these blowouts.

A regression model using minutes played would likely think Malone and Stockton are negative impact players because of this, and Eisley is a positive impact player. Their on-off scores tell the opposite story. Setting a minimum minute threshold of, say, 18 minutes will carve out the more competitive games where he played fewer minutes from Eisley's sample, potentially making this worse.

I could try to detect blowouts through the minute profile of the game and adjust thresholds and minutes accordingly, but it's going to take me awhile to think about.

Not sure if this will work as you intend, unless you also use the actual point margins that overlapped with the specific minutes played, and then you'd basically be doing something similar to RAPM ( but instead of possessions it would be minutes?) right?

The rationale behind for taking MP into account is to better separate between role-players playing limited vs big-minute players right? Why not stick to the original WOWY matrix, but then factor the value you obtain by minutes played/48?

For example, Magic I believe had + points-margin/game (is this the right unit?) of 6? If he played 40 mpg then do 6*(40/48). In the case of Ed Nealy was it 4? If he averaged 15 mpg then 4*(15/48). If that makes sense.

edit:
okay maybe that doesn't make sense now that I think more about it, if you want points-margin per minute in this case you have to do 6/40 for Magic and for Ed it would be 4/15. oops

Ed Nealy GOAT arc confirmed!

Yeah, it’s going to be tricky coming up with a sensible approach to this. One relatively simple thing to do is to impose a penalty factor that is weighted my MPG (or total minutes) across the sample. That way the 18-20 MPG guys would have harsher penalties than the 40 MPG guys. They would cluster more around 0, which would likely make the higher minute guys spread toward the extremes.

Post #129 » by **Moonbeam** » Fri Aug 4, 2023 1:52 pm

eminence wrote:JE had similar issues with his 90s 'RAPM' and did his simulated box-score thing, but I really think that's getting too far into the weeds, I like it more as it is currently vs going the estimation within a simulation route.

How severe does the collinearity problem look at 3/4/5 year splits? I imagine below that it's extreme, and above that you're getting into career range.

Yeah, there is a sort of beauty in the simplicity of this as it currently stands. I’m stilling running the 3-year windows and have to see how they compare.

OhayoKD · Post #130 » by **OhayoKD** » Fri Aug 4, 2023 11:32 pm

Doctor MJ wrote:
eminence wrote:
Doctor MJ wrote:So, just wanted to have a post specifically for the 100th percentile surfers. Basically guys who regularly hit that 100th percentile in sustained runs in the 90s and above.

George Mikan
Bill Russell
Wilt Chamberlain
Oscar Robertson
Jerry West
Bill Walton
Larry Bird
Magic Johnson
Michael Jordan
Shaquille O'Neal

Honestly, seems about right. Curious who else is like that when we see more graphs.

My guesses would be Duncan/KG/Dirk/LeBron/CP3/Steph based of the more granular stuff, but who knows.

I would enjoy having some of this stuff in a spreadsheet/table to browse for sure.

Regardless, I do think the onus is finding arguments for the non-100th-percentile guys over the 100th-percentile guys.

Eh...not sure I agree with this.

First off arguments have been made that involve much larger samples and which do not rely on data tied largely to when players happen to miss games:

Spoiler:

(Will circle back to this later)

More importantly, this(just like real RAPM) is not designed to distinguish between 71 or 72 Kareem or 96-98 Jordan. So sorting players into whether they hit the 99th or 100th percentile is kind of missing the point. If you want to compare the highest highs, rapm and rapm approximations are not designed for that as they are curving those highs down.

What matters here is frequency

Kareem is at or higher than the 90th percentile 12 times scoring at the top level for nearly a decade. You might also note that he goes down when 72, 77, and 1980 are introduced. "peaks," which replace down-years in terms of on-court results , but where Kareem doesn't miss any time. There is also srs suppression from 74-onward(you might notice that jordan by comparison is suddenly skyrocketing when srs for all the top teams goes up after being well behind the pace for what is conventially considered his prime)

And yet with all the above, Kareem still is constantly hovering around the top and then adds a bunch of value later.

Yet, applying a very arbitrary filter for one-offs, you've found a way to get him tiered below a shitton of players he looks as good or better than when we do year-by year analysis or focus on concentrated samples of off, and he is very clearly, "by impact" a much more clear cut era #1.

I think the onus is on you to explain why --this-- matters more than all the other arguments/evidence people have made/offered, especially when we're sneaking in MJ, Bird, and Shaq alongside actual(emperical) impact kings like Magic and Russell and more consistent contenders(at least by this metric) like Wilt(who I do not think "seems right" according to your priors).

Also FWIW, I'm not sure putting all your stock in this does all that for Bird because even by the seasonal inputs of a guy who had him higher than Magic, he still fell down to 14th.

We literally have sourced sets for the metric this r-wowy is trying to emulate for Shaq(and Jordan during the years this metric says he peaked) and both fall considerably short of players you've excluded here like Duncan and KG. Shouldn't the onus be on you?

eminence · Post #131 » by **eminence** » Sat Aug 5, 2023 12:15 am

^Duncan/KG weren't in the doc yet when Doc made his post.

AEnigma · Post #132 » by **AEnigma** » Sat Aug 5, 2023 12:21 am

Garnett also does not fare especially well here relative to those raw Minnesota WOWY numbers, although again I wonder if that might be a consequence of how on-court results are being weighed (not saying that as a criticism; I imagine you have it set for whatever best correlates to RAPM).

OhayoKD · Post #133 » by **OhayoKD** » Sat Aug 5, 2023 12:32 am

AEnigma wrote:Garnett also does not fare especially well here relative to those raw Minnesota WOWY numbers, although again I wonder if that might be a consequence of how on-court results are being weighed (not saying that as a criticism; I imagine you have it set for whatever best correlates to RAPM).

isn't this supposed to be an rapm approximation?

though i guess i'm curious why rapm disagrees so much with moonbeam's method on shaq and kg

OhayoKD · Post #134 » by **OhayoKD** » Sat Aug 5, 2023 12:33 am

eminence wrote:^Duncan/KG weren't in the doc yet when Doc made his post.

fair enough

OhayoKD · Post #135 » by **OhayoKD** » Sat Aug 5, 2023 12:35 am

Moonbeam wrote:[

Would it be possible to see a raw data chart like you did with the 80's for the other decades you have done?

Post #136 » by **Moonbeam** » Sat Aug 5, 2023 12:55 am

OhayoKD wrote:
AEnigma wrote:Garnett also does not fare especially well here relative to those raw Minnesota WOWY numbers, although again I wonder if that might be a consequence of how on-court results are being weighed (not saying that as a criticism; I imagine you have it set for whatever best correlates to RAPM).

isn't this supposed to be an rapm approximation?

though i guess i'm curious why rapm disagrees so much with moonbeam's method on shaq and kg

I think what might be happening here is some sort of winning bias. Garnett's teams generally capping out at good but not great might limit his ceiling. This could be offset in some cases if players miss a decent number of games to inform a "without" sample, but KG was an iron man in Minnesota for the most part.

Shaq, on the other hand, had better team success so his "on" baseline would generally be higher as a result. I think this might be part of what we are seeing with those dynasty teams Doc asked for --- there often is a bit of a high cluster for those players when those teams had lots of success.

I haven't done anything particularly special to weight the "on" vs. "off" sample. That it correlates moderately well to Cheema's stuff is pretty good, IMO, as Cheema's stuff has a few key differences from what I've done so far:

1. Cheema's RAPM is prior-informed, so it is not shrinking the coefficients toward 0 like I've done, but some other value that is informed by minutes per game adjusted for team quality. Doing this will sort of put the thumb on the scale in favor of players who play a good number of minutes on good teams.

2. Cheema's RAPM also assigns twice the weight to playoff games in comparison to regular season games, but I have weighted everything the same.

What I've put together at this stage is about as pure as it gets in terms of WOWY regression. I imagine if I similarly apply some sort of prior based on MPG adjusted for team quality and incorporated extra weight for playoff games, the results would correlate more strongly. Cheema has said that introducing the prior improves predictive performance, so it's certainly something worth considering, but I'd have to re-code everything as I've done mine using a frequentist approach instead of a Bayesian one.

The next thing I'm looking to do is compare predictive performance using withheld game data for these pure versions vs. some modifications I'm thinking up.

Post #137 » by **Moonbeam** » Sat Aug 5, 2023 12:56 am

OhayoKD wrote:
Moonbeam wrote:[

Would it be possible to see a raw data chart like you did with the 80's for the other decades you have done?

Sure! Maybe what I can do in the short term is put together a spreadsheet with the top 30 or so players for the different 5-year windows. I've got 3-year windows now, too, if that might be of interest.

OhayoKD · Post #138 » by **OhayoKD** » Sat Aug 5, 2023 1:02 am

Moonbeam wrote:
OhayoKD wrote:
Moonbeam wrote:[

Would it be possible to see a raw data chart like you did with the 80's for the other decades you have done?

Sure! Maybe what I can do in the short term is put together a spreadsheet with the top 30 or so players for the different 5-year windows. I've got 3-year windows now, too, if that might be of interest.

I believe it would be

Post #139 » by **Moonbeam** » Sat Aug 5, 2023 2:36 am

Here is a spreadsheet with up to 100 positive coefficients for each 5-year window for Ridge, Lasso, and ENet. I'll see if a spreadsheet with the full data is navigable and post separately if so.

OhayoKD · Post #140 » by **OhayoKD** » Sat Aug 5, 2023 2:56 am

Moonbeam wrote:Here is a spreadsheet with up to 100 positive coefficients for each 5-year window for Ridge, Lasso, and ENet. I'll see if a spreadsheet with the full data is navigable and post separately if so.

So basically Russell and Magic look awesome and everyon else looks not so awesome :lol:

Also, uh:

Yikes!