Penalized Regression of WOWY data

Moderators: Clyde Frazier, Doctor MJ, trex_8063, penbeast0, PaulieWal

User avatar
homecourtloss
RealGM
Posts: 11,322
And1: 18,729
Joined: Dec 29, 2012

Re: Penalized Regression of WOWY data 

Post#181 » by homecourtloss » Wed Aug 9, 2023 1:53 pm

eminence wrote:
ShaqAttac wrote:doesnt this mean colineary is overrating all of chicagos top players


Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.


It’s interesting looking at the three different models. I believe
Moonbeam wrote:
mentioned early in this thread that the Ridge version most resembles RAPM since Ridge is the best method to deal with collinearity.

The LASSO charts are also interesting. Moon—I’m wondering in your data which features were selected out, i.e., coefficients were taken down to zero in LASSO?
lessthanjake wrote:Kyrie was extremely impactful without LeBron, and basically had zero impact whatsoever if LeBron was on the court.

lessthanjake wrote: By playing in a way that prevents Kyrie from getting much impact, LeBron ensures that controlling for Kyrie has limited effect…
DraymondGold
Senior
Posts: 590
And1: 764
Joined: May 19, 2022

Re: Penalized Regression of WOWY data 

Post#182 » by DraymondGold » Wed Aug 9, 2023 5:20 pm

eminence wrote:
ShaqAttac wrote:doesnt this mean colineary is overrating all of chicagos top players


Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.
The idea that collinearity is bringing Armstrong up significantly (while slightly depressing MJ/Pippen) gains more credence when we look at the non-Bulls BJ sample.

Collinearity becomes an issue when player's with / without sample correlate strongly. BJ missed literally *1* Bulls game from 1990 to 1995. Pippen missed 14 games from 1990–1995, and Jordan missed 6 games from 1990–1993 (though he did miss more in 94/95). So it's not a lot of missed games to isolate BJ's value with.

Then we get to BJ's 1992–96 sample. Suddenly we go from having at most 1 missed game for BJ to a full season of missed games for BJ (he missed the full 1996 season for the Bulls, the full 1995 season for the Warriors, etc.) And BJ drops from a constant 90th percentile or better, often skirting with the 99th percentile, to a perennial 50th percentile player. Is BJ suddenly getting worse by trading his 23 year old season for his 28 year old season? Unlikely. Even more unlikely when we consider his box stats are all better as a 28 year old vs 23 year old.

Instead it seems like having a larger off-sample makes it clear that BJ wasn't the main cause of the Bulls' success -- this would support the idea that collinearity was boosting BJ's numbers in the early 90s and downgrading the other Bulls' numbers more than the 'true' value.

...

A similar argument could be made for Grant, though to a lesser extent. Grant looks like a strong positive over his full prime, and it makes sense that two of the Top 20 teams ever (91 and 92 Bulls) would require strong supporting players outside their GOAT-tier star. But exactly how positive was he in the late 80s and early 90s?

The issue with WOWY data is that we typically don’t have a large off-sample, which can lead to massive uncertainty bands. One thing you’ll notice in a lot of these ‘career curves’: Many stars seem to look better in their first or second sample (when we have a sufficient-off sample for them) than they do in their fourth or fifth sample (when we have a much smaller off-sample). Shaq, Garnett, AD, Sam Jones, Wilt, Nate Thurmond, Kareem, Larry Bird, Jordan, Stockton, Rodman, and Drexler all have this shape. Are we to believe that all these players are getting worse after their first year? Doubtful. Instead, I think this is partially explained by having a declining off-sample size making it harder to accurately pin their value.

Other players seem to grade worse once we get a larger off-sample. Grant is one of those players. Grant’s first three samples hover around the ~80th percentile, when we have a large off-sample for him. He then jumps up when we trade 1986 (off sample) for 1991 (all-time-level team). Makes sense. Now some of this is likely Grant himself improving. But some of this may be boosted by collinearity with the other improving Bulls members, and a lack of an off-sample to effectively single out Grant’s contributions. Grant looks great from 1987–91 to 1990–94, when playing with a great cast and without much of an off-sample.

Then we get to 1991–1995, when Grant was traded and we gain full-season length off samples for Grant… and suddenly Jordan looks significantly better than Grant (Jordan goes from 17–20% better in 89-93/90–94 to 57–63% better in 91–95/92–96). Is Grant suddenly getting worse at the age of 29, in the middle of his career when he played until the age 38? I’d say it’s more likely that the larger off-sample allows us to helps the model limit collinearity, and see that Jordan (and to a lesser extent Pippen) are significantly better than Grant, although Grant is still absolutely a positive contributor.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,279
And1: 5,086
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#183 » by Moonbeam » Sun Aug 13, 2023 12:20 am

homecourtloss wrote:
eminence wrote:
ShaqAttac wrote:doesnt this mean colineary is overrating all of chicagos top players


Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.


It’s interesting looking at the three different models. I believe
Moonbeam wrote:
mentioned early in this thread that the Ridge version most resembles RAPM since Ridge is the best method to deal with collinearity.

The LASSO charts are also interesting. Moon—I’m wondering in your data which features were selected out, i.e., coefficients were taken down to zero in LASSO?


The coefficients set to 0 are the impact coefficients for those players whose signals weren't strong enough to overcome the penalty. The same thing applies for Maxent. Roughly speaking, around 30-70% of the players in any of these 5-year windows have their coefficients zapped to 0.
User avatar
eminence
RealGM
Posts: 16,863
And1: 11,698
Joined: Mar 07, 2015

Re: Penalized Regression of WOWY data 

Post#184 » by eminence » Sun Aug 13, 2023 12:25 am

Moonbeam wrote:
homecourtloss wrote:
eminence wrote:
Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.


It’s interesting looking at the three different models. I believe
Moonbeam wrote:
mentioned early in this thread that the Ridge version most resembles RAPM since Ridge is the best method to deal with collinearity.

The LASSO charts are also interesting. Moon—I’m wondering in your data which features were selected out, i.e., coefficients were taken down to zero in LASSO?


The coefficients set to 0 are the impact coefficients for those players whose signals weren't strong enough to overcome the penalty. The same thing applies for Maxent. Roughly speaking, around 30-70% of the players in any of these 5-year windows have their coefficients zapped to 0.


Is that 30-70% of guys who meet the minutes cut-off?
I bought a boat.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,279
And1: 5,086
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#185 » by Moonbeam » Sun Aug 13, 2023 12:26 am

DraymondGold wrote:
eminence wrote:
ShaqAttac wrote:doesnt this mean colineary is overrating all of chicagos top players


Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.
The idea that collinearity is bringing Armstrong up significantly (while slightly depressing MJ/Pippen) gains more credence when we look at the non-Bulls BJ sample.

Collinearity becomes an issue when player's with / without sample correlate strongly. BJ missed literally *1* Bulls game from 1990 to 1995. Pippen missed 14 games from 1990–1995, and Jordan missed 6 games from 1990–1993 (though he did miss more in 94/95). So it's not a lot of missed games to isolate BJ's value with.

Then we get to BJ's 1992–96 sample. Suddenly we go from having at most 1 missed game for BJ to a full season of missed games for BJ (he missed the full 1996 season for the Bulls, the full 1995 season for the Warriors, etc.) And BJ drops from a constant 90th percentile or better, often skirting with the 99th percentile, to a perennial 50th percentile player. Is BJ suddenly getting worse by trading his 23 year old season for his 28 year old season? Unlikely. Even more unlikely when we consider his box stats are all better as a 28 year old vs 23 year old.

Instead it seems like having a larger off-sample makes it clear that BJ wasn't the main cause of the Bulls' success -- this would support the idea that collinearity was boosting BJ's numbers in the early 90s and downgrading the other Bulls' numbers more than the 'true' value.

...

A similar argument could be made for Grant, though to a lesser extent. Grant looks like a strong positive over his full prime, and it makes sense that two of the Top 20 teams ever (91 and 92 Bulls) would require strong supporting players outside their GOAT-tier star. But exactly how positive was he in the late 80s and early 90s?

The issue with WOWY data is that we typically don’t have a large off-sample, which can lead to massive uncertainty bands. One thing you’ll notice in a lot of these ‘career curves’: Many stars seem to look better in their first or second sample (when we have a sufficient-off sample for them) than they do in their fourth or fifth sample (when we have a much smaller off-sample). Shaq, Garnett, AD, Sam Jones, Wilt, Nate Thurmond, Kareem, Larry Bird, Jordan, Stockton, Rodman, and Drexler all have this shape. Are we to believe that all these players are getting worse after their first year? Doubtful. Instead, I think this is partially explained by having a declining off-sample size making it harder to accurately pin their value.

Other players seem to grade worse once we get a larger off-sample. Grant is one of those players. Grant’s first three samples hover around the ~80th percentile, when we have a large off-sample for him. He then jumps up when we trade 1986 (off sample) for 1991 (all-time-level team). Makes sense. Now some of this is likely Grant himself improving. But some of this may be boosted by collinearity with the other improving Bulls members, and a lack of an off-sample to effectively single out Grant’s contributions. Grant looks great from 1987–91 to 1990–94, when playing with a great cast and without much of an off-sample.

Then we get to 1991–1995, when Grant was traded and we gain full-season length off samples for Grant… and suddenly Jordan looks significantly better than Grant (Jordan goes from 17–20% better in 89-93/90–94 to 57–63% better in 91–95/92–96). Is Grant suddenly getting worse at the age of 29, in the middle of his career when he played until the age 38? I’d say it’s more likely that the larger off-sample allows us to helps the model limit collinearity, and see that Jordan (and to a lesser extent Pippen) are significantly better than Grant, although Grant is still absolutely a positive contributor.


Great breakdown of the nature of WOWY data and how team changes can have a significant impact on the percentiles. I'm working on a few variations which take into account minutes played in a game vs. MPG for a season. This might add a bit more nuance to the values, as B.J. Armstrong will likely get some games included from his rookie year this way, and changing minute profiles for Armstrong and Grant (at least from his rookie season) could also have some influence.
ShaqAttac
Rookie
Posts: 1,184
And1: 366
Joined: Oct 18, 2022
 

Re: Penalized Regression of WOWY data 

Post#186 » by ShaqAttac » Mon Aug 14, 2023 3:04 am

Moonbeam wrote:
DraymondGold wrote:
eminence wrote:
Not what collinearity does. It makes us less sure of our results, but it can't just push the whole variable group up (or down) in the overall model.

In that particular case, I feel pretty confident saying it's likely Armstrong being brought along for the ride, not MJ/Pippen* - their numbers would be somewhat depressed by BJ's impressive result.

*nobody necessarily needs to be 'brought along' either, models can be reasonably accurate (in terms of telling us which variables are having what impact) in spite of collinearity

The overall model accuracy is not hurt by collinearity, though it does make it more likely to overfit your model.
The idea that collinearity is bringing Armstrong up significantly (while slightly depressing MJ/Pippen) gains more credence when we look at the non-Bulls BJ sample.

Collinearity becomes an issue when player's with / without sample correlate strongly. BJ missed literally *1* Bulls game from 1990 to 1995. Pippen missed 14 games from 1990–1995, and Jordan missed 6 games from 1990–1993 (though he did miss more in 94/95). So it's not a lot of missed games to isolate BJ's value with.

Then we get to BJ's 1992–96 sample. Suddenly we go from having at most 1 missed game for BJ to a full season of missed games for BJ (he missed the full 1996 season for the Bulls, the full 1995 season for the Warriors, etc.) And BJ drops from a constant 90th percentile or better, often skirting with the 99th percentile, to a perennial 50th percentile player. Is BJ suddenly getting worse by trading his 23 year old season for his 28 year old season? Unlikely. Even more unlikely when we consider his box stats are all better as a 28 year old vs 23 year old.

Instead it seems like having a larger off-sample makes it clear that BJ wasn't the main cause of the Bulls' success -- this would support the idea that collinearity was boosting BJ's numbers in the early 90s and downgrading the other Bulls' numbers more than the 'true' value.

...

A similar argument could be made for Grant, though to a lesser extent. Grant looks like a strong positive over his full prime, and it makes sense that two of the Top 20 teams ever (91 and 92 Bulls) would require strong supporting players outside their GOAT-tier star. But exactly how positive was he in the late 80s and early 90s?

The issue with WOWY data is that we typically don’t have a large off-sample, which can lead to massive uncertainty bands. One thing you’ll notice in a lot of these ‘career curves’: Many stars seem to look better in their first or second sample (when we have a sufficient-off sample for them) than they do in their fourth or fifth sample (when we have a much smaller off-sample). Shaq, Garnett, AD, Sam Jones, Wilt, Nate Thurmond, Kareem, Larry Bird, Jordan, Stockton, Rodman, and Drexler all have this shape. Are we to believe that all these players are getting worse after their first year? Doubtful. Instead, I think this is partially explained by having a declining off-sample size making it harder to accurately pin their value.

Other players seem to grade worse once we get a larger off-sample. Grant is one of those players. Grant’s first three samples hover around the ~80th percentile, when we have a large off-sample for him. He then jumps up when we trade 1986 (off sample) for 1991 (all-time-level team). Makes sense. Now some of this is likely Grant himself improving. But some of this may be boosted by collinearity with the other improving Bulls members, and a lack of an off-sample to effectively single out Grant’s contributions. Grant looks great from 1987–91 to 1990–94, when playing with a great cast and without much of an off-sample.

Then we get to 1991–1995, when Grant was traded and we gain full-season length off samples for Grant… and suddenly Jordan looks significantly better than Grant (Jordan goes from 17–20% better in 89-93/90–94 to 57–63% better in 91–95/92–96). Is Grant suddenly getting worse at the age of 29, in the middle of his career when he played until the age 38? I’d say it’s more likely that the larger off-sample allows us to helps the model limit collinearity, and see that Jordan (and to a lesser extent Pippen) are significantly better than Grant, although Grant is still absolutely a positive contributor.


Great breakdown of the nature of WOWY data and how team changes can have a significant impact on the percentiles. I'm working on a few variations which take into account minutes played in a game vs. MPG for a season. This might add a bit more nuance to the values, as B.J. Armstrong will likely get some games included from his rookie year this way, and changing minute profiles for Armstrong and Grant (at least from his rookie season) could also have some influence.

arent wowy samples bigger?

also im confused about zappin players to 0. what does that mean? are you not counting em
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,279
And1: 5,086
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#187 » by Moonbeam » Mon Aug 14, 2023 4:28 am

ShaqAttac wrote:
Moonbeam wrote:
DraymondGold wrote: The idea that collinearity is bringing Armstrong up significantly (while slightly depressing MJ/Pippen) gains more credence when we look at the non-Bulls BJ sample.

Collinearity becomes an issue when player's with / without sample correlate strongly. BJ missed literally *1* Bulls game from 1990 to 1995. Pippen missed 14 games from 1990–1995, and Jordan missed 6 games from 1990–1993 (though he did miss more in 94/95). So it's not a lot of missed games to isolate BJ's value with.

Then we get to BJ's 1992–96 sample. Suddenly we go from having at most 1 missed game for BJ to a full season of missed games for BJ (he missed the full 1996 season for the Bulls, the full 1995 season for the Warriors, etc.) And BJ drops from a constant 90th percentile or better, often skirting with the 99th percentile, to a perennial 50th percentile player. Is BJ suddenly getting worse by trading his 23 year old season for his 28 year old season? Unlikely. Even more unlikely when we consider his box stats are all better as a 28 year old vs 23 year old.

Instead it seems like having a larger off-sample makes it clear that BJ wasn't the main cause of the Bulls' success -- this would support the idea that collinearity was boosting BJ's numbers in the early 90s and downgrading the other Bulls' numbers more than the 'true' value.

...

A similar argument could be made for Grant, though to a lesser extent. Grant looks like a strong positive over his full prime, and it makes sense that two of the Top 20 teams ever (91 and 92 Bulls) would require strong supporting players outside their GOAT-tier star. But exactly how positive was he in the late 80s and early 90s?

The issue with WOWY data is that we typically don’t have a large off-sample, which can lead to massive uncertainty bands. One thing you’ll notice in a lot of these ‘career curves’: Many stars seem to look better in their first or second sample (when we have a sufficient-off sample for them) than they do in their fourth or fifth sample (when we have a much smaller off-sample). Shaq, Garnett, AD, Sam Jones, Wilt, Nate Thurmond, Kareem, Larry Bird, Jordan, Stockton, Rodman, and Drexler all have this shape. Are we to believe that all these players are getting worse after their first year? Doubtful. Instead, I think this is partially explained by having a declining off-sample size making it harder to accurately pin their value.

Other players seem to grade worse once we get a larger off-sample. Grant is one of those players. Grant’s first three samples hover around the ~80th percentile, when we have a large off-sample for him. He then jumps up when we trade 1986 (off sample) for 1991 (all-time-level team). Makes sense. Now some of this is likely Grant himself improving. But some of this may be boosted by collinearity with the other improving Bulls members, and a lack of an off-sample to effectively single out Grant’s contributions. Grant looks great from 1987–91 to 1990–94, when playing with a great cast and without much of an off-sample.

Then we get to 1991–1995, when Grant was traded and we gain full-season length off samples for Grant… and suddenly Jordan looks significantly better than Grant (Jordan goes from 17–20% better in 89-93/90–94 to 57–63% better in 91–95/92–96). Is Grant suddenly getting worse at the age of 29, in the middle of his career when he played until the age 38? I’d say it’s more likely that the larger off-sample allows us to helps the model limit collinearity, and see that Jordan (and to a lesser extent Pippen) are significantly better than Grant, although Grant is still absolutely a positive contributor.


Great breakdown of the nature of WOWY data and how team changes can have a significant impact on the percentiles. I'm working on a few variations which take into account minutes played in a game vs. MPG for a season. This might add a bit more nuance to the values, as B.J. Armstrong will likely get some games included from his rookie year this way, and changing minute profiles for Armstrong and Grant (at least from his rookie season) could also have some influence.

arent wowy samples bigger?

also im confused about zappin players to 0. what does that mean? are you not counting em


The sample windows are bigger here, being 5 year samples, but as it is only relying on who played in a game, its "off" samples aren't necessarily that big, particularly for guys who didn't miss many games. That's why I've been using 5-year windows mostly. Adapting this to include some function of actual minutes played instead of in vs. out could help a bit with this.

The players whose coefficients are zapped to 0 are included in the model, but the model suggests that their impact coefficients aren't large enough to be distinguishable from 0.
OhayoKD
Head Coach
Posts: 6,032
And1: 3,916
Joined: Jun 22, 2022
 

Re: Penalized Regression of WOWY data 

Post#188 » by OhayoKD » Thu Aug 17, 2023 8:35 am

Moonbeam wrote:
ShaqAttac wrote:
Moonbeam wrote:
Great breakdown of the nature of WOWY data and how team changes can have a significant impact on the percentiles. I'm working on a few variations which take into account minutes played in a game vs. MPG for a season. This might add a bit more nuance to the values, as B.J. Armstrong will likely get some games included from his rookie year this way, and changing minute profiles for Armstrong and Grant (at least from his rookie season) could also have some influence.

arent wowy samples bigger?

also im confused about zappin players to 0. what does that mean? are you not counting em


The sample windows are bigger here, being 5 year samples, but as it is only relying on who played in a game, its "off" samples aren't necessarily that big, particularly for guys who didn't miss many games. That's why I've been using 5-year windows mostly. Adapting this to include some function of actual minutes played instead of in vs. out could help a bit with this.

The players whose coefficients are zapped to 0 are included in the model, but the model suggests that their impact coefficients aren't large enough to be distinguishable from 0.

I believe they're talking per-season. Ex: You can theoretically get up to 82-games without a player if they miss a season. That is largely nuetered as an advantaged when you use longer-time periods though. Players usually don't miss that many games year after year, so longer time-intervals end up muddying things.
its my last message in this thread, but I just admit, that all the people, casual and analytical minds, more or less have consencus who has the weight of a rubberized duck. And its not JaivLLLL
User avatar
homecourtloss
RealGM
Posts: 11,322
And1: 18,729
Joined: Dec 29, 2012

Re: Penalized Regression of WOWY data 

Post#189 » by homecourtloss » Sun Aug 20, 2023 11:34 pm

Moon— with the activity going on in the other thread about Isaiah Thomas, I was wondering if you could create an Ridge WoWY Isaiah Thomas, Joe Dumars, Dennis Rodman, Bill Laimbeer, John Sally, and Vinnie Johnson.
lessthanjake wrote:Kyrie was extremely impactful without LeBron, and basically had zero impact whatsoever if LeBron was on the court.

lessthanjake wrote: By playing in a way that prevents Kyrie from getting much impact, LeBron ensures that controlling for Kyrie has limited effect…
User avatar
LA Bird
Analyst
Posts: 3,594
And1: 3,332
Joined: Feb 16, 2015

Re: Penalized Regression of WOWY data 

Post#190 » by LA Bird » Wed Oct 25, 2023 3:17 pm

Revisiting this thread for the top 100 project and was wondering if there is a version of this result table which filters out players with under 2000 min in any 5 year sample?

Moonbeam wrote:Here is a spreadsheet with up to 100 positive coefficients for each 5-year window for Ridge, Lasso, and ENet. I'll see if a spreadsheet with the full data is navigable and post separately if so.

Just so we don't have guys like Mac McClung cluttering up the leaderboard.
User avatar
Moonbeam
Forum Mod - Blazers
Forum Mod - Blazers
Posts: 10,279
And1: 5,086
Joined: Feb 21, 2009
Location: Sydney, Australia
     

Re: Penalized Regression of WOWY data 

Post#191 » by Moonbeam » Fri Oct 27, 2023 5:46 am

LA Bird wrote:Revisiting this thread for the top 100 project and was wondering if there is a version of this result table which filters out players with under 2000 min in any 5 year sample?

Moonbeam wrote:Here is a spreadsheet with up to 100 positive coefficients for each 5-year window for Ridge, Lasso, and ENet. I'll see if a spreadsheet with the full data is navigable and post separately if so.

Just so we don't have guys like Mac McClung cluttering up the leaderboard.


I haven't put one together, but I'll see what I can do. :)
User avatar
WestGOAT
Veteran
Posts: 2,594
And1: 3,518
Joined: Dec 20, 2015

Re: Penalized Regression of WOWY data 

Post#192 » by WestGOAT » Tue Oct 1, 2024 3:56 pm

Moonbeam wrote:
WestGOAT wrote:
Moonbeam wrote:


Still early stages, but I'm more than happy to share some details! Here is a link to a sample of the data I'm using:
https://docs.google.com/spreadsheets/d/15tkzunJ4S0t4USqn2I9C4G82m0JNOZBViMBjLrBvc04/edit?usp=sharing

Possessions in a game (Nan_POS) is indeed the y-variable. I used the formula provided by basketball-reference:
https://www.basketball-reference.com/about/glossary.html#:~:text=Poss%20%2D%20Possessions%20(available%20since%20the,FG)%20%2B%20Opp%20TOV)):

Nan_Phase, Nan_Season, Nan_Tm_ID, Nan_Opp are categorical variables that I have one-hot encoded for modelling purposes, so the number of predictor variables is actually longer than the number of columns in the google docsheet.

I collected basic game log data for every team for the seasons of interest from bball-reference:
https://www.basketball-reference.com/teams/PHI/1986/gamelog/
https://www.basketball-reference.com/teams/PHI/1986_games.html (for total minutes played)

And supplemented it with regular season (RS) and playoff (PS) data:
https://www.basketball-reference.com/leagues/NBA_1986.html#totals-team
https://www.basketball-reference.com/leagues/NBA_1986.html#advanced-team

I have not decided on which predictor variables to include in a final model, there is a lot of multicollinearity as expected, and I haven't tried standardized them yet. I also want to continue experimenting with different types of models, but it seems like a simple out of the box multiple linear regressions seems to be performing the best (on the validation dataset) so far :lol:

Image
Image


Thank you for those details! It is interesting that multiple linear regression is producing the best results so far. This probably deserves its own thread! If I do look to separate Offensive and Defensive versions of these RWOWY ratings, I would be keen to incorporate the estimates from your best models.

Perhaps someone like 70sFan or Squared2020, who have logged a lot of historical game data, might have tracked possessions as well that could be used for testing?


I finally came back to this and realized why the linear regression model was performing so well: the dependent (y) variable possessions is being provided by the bball-reference formula (https://www.basketball-reference.com/about/glossary.html#:~:text=Poss%20-%20Possessions%20(available%20since%20the,FG)%20%2B%20Opp%20TOV)).), so it shouldn't be surprising that best fitting model would also also be a linear regression :lol:.

Using actual possession counts of full games would be best, it would indeed be great if we had a representative sample of historic games. I'm guessing 300+ games pre-1985 would be a a decent start.
Image
spotted in Bologna

Return to Player Comparisons