DQuinn1575 wrote:I get there is only one equation;
it is one equation for each possession. That makes roughly 230k equations per season, where the players are the independent variables and the result is the dependent. One equation in normal least square regression (APM) looks like that:
Result = Const+x1P1+x2P2+x3P3+x4P4+x5P5-x6P6-x7P7-x8P8-x9P9-x10P10
P1 to P5 are the home players, while x1 to x5 are their respective coefficients (APM values), P6 to P10 are the away players with x6 to x10 would be their APM values. This equation is only fulfilled, if both groups of players are included, therefore an "adjustment" for the opponents is happening automatically.
The const is the homecourt advantage, this "adjustment" will also happening automatically.
We could solve that iterative like Chicago76 suggests, but I prefer a faster way: transfering that to a matrix algebra problem. Then we can use matrix algebra to solve the issue rather quickly, where the players would just be 1 (home) or -1 (away) in a design matrix, and the results of the possession would be listed in a response vector.
That would look like this:
with:
β is the coefficent vector (the APM values for each player)
X is the design matrix
X^T is the transpose design matrix
y is the response vector
Now, the used method is ridge regression, which means a penalty term λ is used, and that's is introducing a bias, but overall that helps keeping the results in check (so to speak). Matter of fact is that it was proven that for an ill-posed problem ridge regression produces a lower error than OLS. And we have an ill-posed problem here, because we have more equations than variables (as I said, roughly 230k equations, but only about 480 variables/players).
Now the ridge looks like that:
As you can see there is just a term λI added (that equation is without a prior, that could be added as another vector with selected values for each player or as a term forcing a specific distribution). Anyway, I is the identity matrix (basically just a bunch of zeros with the main diagonal being ones).
The introduced bias would look like this:
Code: Select all
bias(β) = -λUβ
with U = (X^(T)X + λI)^(-1)
When we have the βs, we have the results. For OLS, the βs should be give an average of 0 overall (weighted average), for ridge the result needs to be shifted to the weighted average will be 0. But there is no further "adjustment" made.
DQuinn1575 wrote:1. Baseball basically ignores it and assumes things level out - they don't adjust for pitcher faced, defensive lineup faced, and only adjust park at a higher level.
2. It is easier to reverse test the results if opponents are held constant, as I can take the numbers into models it and see how accurate it is.
3. I have to believe once you adjust for team schedules the impact of opponents can't be that great. Within the same team, 2 regulars will see virtually the same average of opponents.
Even comparing Durant to LeBron, there can't be that big a difference in average opponents - I'm assuming once you adjust for schedule the impact can only be .2 to .3
1. I have no clue about baseball and have no desire going into it. I simply don't care about that game, because it is probably the most boring thing I ever witnessed.
2. That makes zero sense. The results can be tested in out-of-sample with or without such an "adjustment". But given the fact that it makes not much sense to apply the regression just to one part of the players on the court instead of taking all, I have still no idea what you want to accomplish anyway.
3. Again, that makes zero sense. The "adjustments" happening per se within the regression, there are no terms added to the results later or something like that. It seems you are not really grasping what is done in the first place.