org.apache.commons.math.stat.regression
Class SimpleRegression

java.lang.Object
  extended by org.apache.commons.math.stat.regression.SimpleRegression
All Implemented Interfaces:
java.io.Serializable

public class SimpleRegression
extends java.lang.Object
implements java.io.Serializable

Estimates an ordinary least squares regression model with one independent variable.

y = intercept + slope * x

Standard errors for intercept and slope are available as well as ANOVA, r-square and Pearson's r statistics.

Observations (x,y pairs) can be added to the model one at a time or they can be provided in a 2-dimensional array. The observations are not stored in memory, so there is no limit to the number of observations that can be added to the model.

Usage Notes:

Version:
$Revision: 348519 $ $Date: 2005-11-23 12:12:18 -0700 (Wed, 23 Nov 2005) $
See Also:
Serialized Form

Field Summary
private  long n
          number of observations
private static long serialVersionUID
          Serializable version identifier
private  double sumX
          sum of x values
private  double sumXX
          total variation in x (sum of squared deviations from xbar)
private  double sumXY
          sum of products
private  double sumY
          sum of y values
private  double sumYY
          total variation in y (sum of squared deviations from ybar)
private  double xbar
          mean of accumulated x values, used in updating formulas
private  double ybar
          mean of accumulated y values, used in updating formulas
 
Constructor Summary
SimpleRegression()
          Create an empty SimpleRegression instance
 
Method Summary
 void addData(double[][] data)
          Adds the observations represented by the elements in data.
 void addData(double x, double y)
          Adds the observation (x,y) to the regression data set.
 void clear()
          Clears all data from the model.
 double getIntercept()
          Returns the intercept of the estimated regression line.
private  double getIntercept(double slope)
          Returns the intercept of the estimated regression line, given the slope.
 double getInterceptStdErr()
          Returns the standard error of the intercept estimate, usually denoted s(b0).
 double getMeanSquareError()
          Returns the sum of squared errors divided by the degrees of freedom, usually abbreviated MSE.
 long getN()
          Returns the number of observations that have been added to the model.
 double getR()
          Returns Pearson's product moment correlation coefficient, usually denoted r.
 double getRegressionSumSquares()
          Returns the sum of squared deviations of the predicted y values about their mean (which equals the mean of y).
private  double getRegressionSumSquares(double slope)
          Computes SSR from b1.
 double getRSquare()
          Returns the coefficient of determination, usually denoted r-square.
private  double getRSquare(double b1)
          Computes r-square from the slope.
 double getSignificance()
          Returns the significance level of the slope (equiv) correlation.
 double getSlope()
          Returns the slope of the estimated regression line.
 double getSlopeConfidenceInterval()
          Returns the half-width of a 95% confidence interval for the slope estimate.
 double getSlopeConfidenceInterval(double alpha)
          Returns the half-width of a (100-100*alpha)% confidence interval for the slope estimate.
 double getSlopeStdErr()
          Returns the standard error of the slope estimate, usually denoted s(b1).
 double getSumSquaredErrors()
          Returns the sum of squared errors (SSE) associated with the regression model.
private  double getSumSquaredErrors(double b1)
          Returns the sum of squared errors associated with the regression model, using the slope of the regression line.
private  TDistribution getTDistribution()
          Uses distribution framework to get a t distribution instance with df = n - 2
 double getTotalSumSquares()
          Returns the sum of squared deviations of the y values about their mean.
 double predict(double x)
          Returns the "predicted" y value associated with the supplied x value, based on the data that has been added to the model when this method is activated.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

private static final long serialVersionUID
Serializable version identifier

See Also:
Constant Field Values

sumX

private double sumX
sum of x values


sumXX

private double sumXX
total variation in x (sum of squared deviations from xbar)


sumY

private double sumY
sum of y values


sumYY

private double sumYY
total variation in y (sum of squared deviations from ybar)


sumXY

private double sumXY
sum of products


n

private long n
number of observations


xbar

private double xbar
mean of accumulated x values, used in updating formulas


ybar

private double ybar
mean of accumulated y values, used in updating formulas

Constructor Detail

SimpleRegression

public SimpleRegression()
Create an empty SimpleRegression instance

Method Detail

addData

public void addData(double x,
                    double y)
Adds the observation (x,y) to the regression data set.

Uses updating formulas for means and sums of squares defined in "Algorithms for Computing the Sample Variance: Analysis and Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American Statistician, vol. 37, pp. 242-247, referenced in Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985

Parameters:
x - independent variable value
y - dependent variable value

addData

public void addData(double[][] data)
Adds the observations represented by the elements in data.

(data[0][0],data[0][1]) will be the first observation, then (data[1][0],data[1][1]), etc.

This method does not replace data that has already been added. The observations represented by data are added to the existing dataset.

To replace all data, use clear() before adding the new data.

Parameters:
data - array of observations to be added

clear

public void clear()
Clears all data from the model.


getN

public long getN()
Returns the number of observations that have been added to the model.

Returns:
n number of observations that have been added.

predict

public double predict(double x)
Returns the "predicted" y value associated with the supplied x value, based on the data that has been added to the model when this method is activated.

predict(x) = intercept + slope * x

Preconditions:

Parameters:
x - input x value
Returns:
predicted y value

getIntercept

public double getIntercept()
Returns the intercept of the estimated regression line.

The least squares estimate of the intercept is computed using the normal equations. The intercept is sometimes denoted b0.

Preconditions:

Returns:
the intercept of the regression line

getSlope

public double getSlope()
Returns the slope of the estimated regression line.

The least squares estimate of the slope is computed using the normal equations. The slope is sometimes denoted b1.

Preconditions:

Returns:
the slope of the regression line

getSumSquaredErrors

public double getSumSquaredErrors()
Returns the sum of squared errors (SSE) associated with the regression model.

Preconditions:

Returns:
sum of squared errors associated with the regression model

getTotalSumSquares

public double getTotalSumSquares()
Returns the sum of squared deviations of the y values about their mean.

This is defined as SSTO here.

If n < 2, this returns Double.NaN.

Returns:
sum of squared deviations of y values

getRegressionSumSquares

public double getRegressionSumSquares()
Returns the sum of squared deviations of the predicted y values about their mean (which equals the mean of y).

This is usually abbreviated SSR or SSM. It is defined as SSM here

Preconditions:

Returns:
sum of squared deviations of predicted y values

getMeanSquareError

public double getMeanSquareError()
Returns the sum of squared errors divided by the degrees of freedom, usually abbreviated MSE.

If there are fewer than three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
sum of squared deviations of y values

getR

public double getR()
Returns Pearson's product moment correlation coefficient, usually denoted r.

Preconditions:

Returns:
Pearson's r

getRSquare

public double getRSquare()
Returns the coefficient of determination, usually denoted r-square.

Preconditions:

Returns:
r-square

getInterceptStdErr

public double getInterceptStdErr()
Returns the standard error of the intercept estimate, usually denoted s(b0).

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
standard error associated with intercept estimate

getSlopeStdErr

public double getSlopeStdErr()
Returns the standard error of the slope estimate, usually denoted s(b1).

If there are fewer that three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
standard error associated with slope estimate

getSlopeConfidenceInterval

public double getSlopeConfidenceInterval()
                                  throws MathException
Returns the half-width of a 95% confidence interval for the slope estimate.

The 95% confidence interval is

(getSlope() - getSlopeConfidenceInterval(), getSlope() + getSlopeConfidenceInterval())

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

Returns:
half-width of 95% confidence interval for the slope estimate
Throws:
MathException - if the confidence interval can not be computed.

getSlopeConfidenceInterval

public double getSlopeConfidenceInterval(double alpha)
                                  throws MathException
Returns the half-width of a (100-100*alpha)% confidence interval for the slope estimate.

The (100-100*alpha)% confidence interval is

(getSlope() - getSlopeConfidenceInterval(), getSlope() + getSlopeConfidenceInterval())

To request, for example, a 99% confidence interval, use alpha = .01

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

Preconditions:

Parameters:
alpha - the desired significance level
Returns:
half-width of 95% confidence interval for the slope estimate
Throws:
MathException - if the confidence interval can not be computed.

getSignificance

public double getSignificance()
                       throws MathException
Returns the significance level of the slope (equiv) correlation.

Specifically, the returned value is the smallest alpha such that the slope confidence interval with significance level equal to alpha does not include 0. On regression output, this is often denoted Prob(|t| > 0)

Usage Note:
The validity of this statistic depends on the assumption that the observations included in the model are drawn from a Bivariate Normal Distribution.

If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

Returns:
significance level for slope/correlation
Throws:
MathException - if the significance level can not be computed.

getIntercept

private double getIntercept(double slope)
Returns the intercept of the estimated regression line, given the slope.

Will return NaN if slope is NaN.

Parameters:
slope - current slope
Returns:
the intercept of the regression line

getSumSquaredErrors

private double getSumSquaredErrors(double b1)
Returns the sum of squared errors associated with the regression model, using the slope of the regression line.

Returns NaN if the slope is NaN.

Parameters:
b1 - current slope
Returns:
sum of squared errors associated with the regression model

getRSquare

private double getRSquare(double b1)
Computes r-square from the slope.

will return NaN if slope is Nan.

Parameters:
b1 - current slope
Returns:
r-square

getRegressionSumSquares

private double getRegressionSumSquares(double slope)
Computes SSR from b1.

Parameters:
slope - regression slope estimate
Returns:
sum of squared deviations of predicted y values

getTDistribution

private TDistribution getTDistribution()
Uses distribution framework to get a t distribution instance with df = n - 2

Returns:
t distribution with df = n - 2