Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
np.random.seed(1234) # for reproducibility
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 956.6
Date: Sat, 21 Sep 2024 Prob (F-statistic): 1.96e-41
Time: 11:58:14 Log-Likelihood: 1.2217
No. Observations: 50 AIC: 5.557
Df Residuals: 46 BIC: 13.20
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9654 0.084 59.175 0.000 4.796 5.134
x1 0.5088 0.013 39.314 0.000 0.483 0.535
x2 0.5651 0.051 11.109 0.000 0.463 0.668
x3 -0.0206 0.001 -18.144 0.000 -0.023 -0.018
==============================================================================
Omnibus: 0.840 Durbin-Watson: 2.269
Prob(Omnibus): 0.657 Jarque-Bera (JB): 0.577
Skew: -0.263 Prob(JB): 0.749
Kurtosis: 2.972 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.44997408 4.96265684 5.43161584 5.82605136 6.12627912 6.32696437
6.4379984 6.4828734 6.49482276 6.51136093 6.56811991 6.69299498
6.90156162 7.1945165 7.55756297 7.96376004 8.37794864 8.76252818
9.08363424 9.31670235 9.45050393 9.48899104 9.45064714 9.36545023
9.26994764 9.20125134 9.19094055 9.25987342 9.41476003 9.64705998
9.93438555 10.24417991 10.53906618 10.78298825 10.9471348 11.01467283
10.98351335 10.86665452 10.69004615 10.4883262 10.29912985 10.15690615
10.08725815 10.10273638 10.20077685 10.36412228 10.56365745 10.76319271
10.92540989 11.01799353]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.00539682 10.84456472 10.55765996 10.1951886 9.82363439 9.50918122
9.30150901 9.2216303 9.25674569 9.36337756]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0xadde5de1e8ed>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.965353
x1 0.508755
np.sin(x1) 0.565142
I((x1 - 5) ** 2) -0.020615
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.005397
1 10.844565
2 10.557660
3 10.195189
4 9.823634
5 9.509181
6 9.301509
7 9.221630
8 9.256746
9 9.363378
dtype: float64