linear regression of wins in the 00s

qianlong · Post #1 » by **qianlong** » Mon Jun 21, 2010 10:26 pm

i have an exam on statistics wednsday but i do not want to study too much so i started elaborating nba statistics in some more serious way. The data set is from www.databasebasketball.com and the progam used is R

I tried to do a linear regression on wins in a year for all teams toghether from year 2000 to 2008.

The original explicative variables were: o_fgm o_fga o_ftm o_fta o_oreb o_dreb o_reb o_asts o_pf o_stl o_to o_blk o_3pm o_3pa o_pts d_fgm d_fga d_ftm d_fta d_oreb d_dreb d_reb d_asts d_pf d_stl d_to d_blk d_3pm d_3pa d_pts.

For reasons that at the moment are unknown to me the program did not like to use the data for points by defense or offense. Excluding variable that were not significant one by one i obtained this:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.8628092 6.4222790 7.608 5.31e-13 ***
d_3pm -0.0324696 0.0029758 -10.911 < 2e-16 ***
d_fgm -0.0640980 0.0020655 -31.032 < 2e-16 ***
d_ftm -0.0307657 0.0016568 -18.569 < 2e-16 ***
o_3pa 0.0127745 0.0006271 20.371 < 2e-16 ***
o_fga -0.0048496 0.0017897 -2.710 0.00719 **
o_fgm 0.0700584 0.0018427 38.020 < 2e-16 ***
o_ftm 0.0314201 0.0015057 20.868 < 2e-16 ***
o_reb 0.0033790 0.0020020 1.688 0.09268 .
o_to -0.0065929 0.0025324 -2.603 0.00977 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.763 on 256 degrees of freedom
Multiple R-squared: 0.9494, Adjusted R-squared: 0.9476
F-statistic: 533.8 on 9 and 256 DF, p-value: < 2.2e-16

qianlong · Post #2 » by **qianlong** » Mon Jun 21, 2010 10:50 pm

i would like to add some image but i don't know how to add them directly from the computer.

Some quick comment on the results. th adjusted R square is very high, i mean really high. For those who don't know is some measure of the goodness of the fit. Residuals behave properly this is seen mostly from the graphics.

On the variables.
From the long list of variables at the beginning all seem important to basketball success, and for sure they are, but thoose that statisticaly contribute to winning are thoose which remain at the end. The most interesting thing is not what it is still there but what is not. There is no Assist variable. Steals, blocks and personal fouls are not significant, that is a lot of things.

For several variables there is some strong correlation, offensive and defensive boards for example are summarized by total boards.

On the defensive side the only variables that matters are thoose that produce point, 3pm, fgm, ftm. On offense there is a similar trend but rebounds and turnovers are added. This result really stress the importance of turnovers for winning.

The sign of the varibles are all on the right side. One thing that looks very bad is the intercept, it should be around 41 not 48, that is something that needs some adjustment. There is a lot of room for improvement on the model, but i think this is already something.

All this analysis is not to say that thoose are the elements to win, i am not stating that you do not need assist, all i mean is that wins are better related to other variables.

If more explanations are needed i will be pleased to give them, i hope it is interesting to thoose that understand something about it, and not too boring for the others.

jicama · Post #3 » by **jicama** » Tue Jun 22, 2010 10:30 am

Assists by definition lead to FG, and so by counting the FG you are counting the value of the assist. Thus, no further correlation.

Box scores, by tradition, contain FG and FGA, FT and FTA. Here again, though, you're double dealing with a made FG or FT . If you convert to makes and misses -- FG and FGX, FT and FTX, you're dealing with distinct events.
FGX = FGA - FG

You've got a positive coefficient on 3FGA -- suggesting that just chucking threes is a good thing. Of course you've got to shoot some to make some, but if you had 3FGX (misses) as a variable, it would surely be negative.

Good luck with this.

pretender_2002 · Post #4 » by **pretender_2002** » Mon Aug 9, 2010 8:06 pm

I recently turned in a project I did comparing the 1990's chicago bulls using team statistics and individual player statistics. I learned several things when it came to basketball and trying to model the wins using linear regression. When I did linear regression my response variable was the point differential of each game.

A. "For reasons that at the moment are unknown to me the program did not like to use the data for points by defense or offense."

The reason the program probably doesn't like this is because if you use both of the offensive and defensive points you have explained who won already. Lets say your response variable is the point differential, this is defined as offensive points - defensive points therefore you would have no residuals and your r-squared value would be 1 (which is perfect). In fact the correlation you provided should be very close to 1, as you can explain who won by D_FGM, D_FTM, D_3PM, O_FGM, O_FTM and O_3PM. This is because there is a formula to the point differential using just these stats and that is:

Point Differential = [ 2*(O_FGM - O_3PM) + 3*O_3PM + O_FTM] - [ 2*(D_FGM - D_3PM) + 3*D_3PM + D_FTM]

That formula should give you a perfect fit line. All you did was change makes to attempts which reduces the r-squared value but not by much. Really the only difference there is what causes the opponent to miss, because that is the difference between 1 and the correlation you got)

B. Next I am not sure why you say your intercept should be 41, if you use the point differential as your response variable then ideally the intercept should be close to 0. Remember the intercept should be what you expect when all of the variables that you are using are 0 (i.e. no rebounds, no 2pts., etc.).

C. Of course if you have already included all these stats on 3PM and FGM etc. for offense and defense then nothing else should be important (rebounds don't count for points, etc.) This is not to say that rebounds are unimportant it just means they do not provide nearly as much as offensive points, etc. Also there may be multicolinearity issues (which you should look up but basically a teams Steals and an opponents turnovers may be highly correlated and including both of them in your model causes weird results and increases the MSE). If you want to see if there are any issues you should generate a correlation matrix.

As far as advice on modeling the wins. I would try to use the least amount of opponent statistics as possible. Also, I would stay away from things that can directly be used to calculate the total points (i.e. fgm) instead use fg% this will lead to better conclusions anyways. Also, if you want to know if rebounds are important, run a model with rebounds only (i.e. how well does the number of rebounds predict the point differential) if the p-value is under .05 then you know it was important. I also used statistics that I defined like how many more defensive rebounds did the team get, etc..

Step back and think for a moment about what your model says. If you can say that you can predict who wins using field goals made by both teams, three pointers made by both teams and free throws made by both teams, then I would say that really takes the meaning out of predicting.

qianlong · Post #5 » by **qianlong** » Thu Aug 12, 2010 1:22 pm

thanks for the reply and the suggestion. i actually forget about this post, so i'm probably not very precise in my reply, also i don't know a lot of statistics.

regarding the comments

a) i think there is a misunderstanding on what is the dependant variable. it is wins per year, not point differential. for this reason i think that points scored and points allowed could be valuable stats, and they shouldn't correlate perfectly. i repeat my self to be clear the record was the regressed variable. Moreover i tried using points allowed and point scored separately and still the program couldn't calculate the parameters. i know there is something wrong but i think is something different from what you suggested.

b) for the same reason as before as the explained variable is the record, i think the intercept should be 41. i agree that should be zero if the variable was point differential.

i totaly agree that there are some major problems of multicollinearity, as you suggested and as did jicama. and i agree that using only statistics of the team should be better.
the model is really bad and as you said didn't have any predicting power. i would have known how to do somthing better, or at least correct some big mistakes, but i was to lazy to manipulate data in my spare time and go through it. i just put the data in and waited for the result. i know is terrible practice.

for the future i'm hoping to do something more interesting and with more thinking behind. and in case i will ask for yours opinions.