Rouse number: Difference between revisions

Revision as of 16:31, 6 July 2013

Multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991.^[1] It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models non-linearities and interactions between variables.

The term "MARS" is trademarked and licensed to Salford Systems. In order to avoid trademark infringements, many open source implementations of MARS are called "Earth".^[2]^[3]

The basics

This section introduces MARS using a few examples. We start with a set of data: a matrix of input variables x, and a vector of the observed responses y, with a response for each row in x. For example, the data could be:

x	y
10.5	16.4
10.7	18.8
10.8	19.7
...	...
20.6	77.0

Here there is only one independent variable, so the x matrix is just a single column. Given these measurements, we would like to build a model which predicts the expected y for a given x.

A linear model for the above data is

{\hat {y}}=-37+5.1x

The hat on the ${\hat {y}}$ indicates that ${\hat {y}}$ is estimated from the data. The figure on the right shows a plot of this function: a line giving the predicted ${\hat {y}}$ versus x, with the original values of y shown as red dots.

The data at the extremes of x indicates that the relationship between y and x may be non-linear (look at the red dots relative to the regression line at low and high values of x). We thus turn to MARS to automatically build a model taking into account non-linearities. MARS software constructs a model from the given x and y as follows

{\begin{aligned}{\hat {y}}=&\ 25\\&+6.1\max(0,x-13)\\&-3.1\max(0,13-x)\\\end{aligned}}

The figure on the right shows a plot of this function: the predicted ${\hat {y}}$ versus x, with the original values of y once again shown as red dots. The predicted response is now a better fit to the original y values.

MARS has automatically produced a kink in the predicted y to take into account non-linearity. The kink is produced by hinge functions. The hinge functions are the expressions starting with $\max$ (where $\max(a,b)$ is $a$ if $a>b$ , else $b$ ). Hinge functions are described in more detail below.

In this simple example, we can easily see from the plot that y has a non-linear relationship with x (and might perhaps guess that y varies with the square of x). However, in general there will be multiple independent variables, and the relationship between y and these variables will be unclear and not easily visible by plotting. We can use MARS to discover that non-linear relationship.

An example MARS expression with multiple variables is

{\begin{aligned}\mathrm {ozone} =&\ 5.2\\&+0.93\max(0,\mathrm {temp} -58)\\&-0.64\max(0,\mathrm {temp} -68)\\&-0.046\max(0,234-\mathrm {ibt} )\\&-0.016\max(0,\mathrm {wind} -7)\max(0,200-\mathrm {vis} )\\\end{aligned}}

This expression models air pollution (the ozone level) as a function of the temperature and a few other variables. Note that the last term in the formula (on the last line) incorporates an interaction between $\mathrm {wind}$ and $\mathrm {vis}$ .

The figure on the right plots the predicted $\mathrm {ozone}$ as $\mathrm {wind}$ and $\mathrm {vis}$ vary, with the other variables fixed at their median values. The figure shows that wind does not affect the ozone level unless visibility is low. We see that MARS can build quite flexible regression surfaces by combining hinge functions.

To obtain the above expression, the MARS model building procedure automatically selects which variables to use (some variables are important, others not), the positions of the kinks in the hinge functions, and how the hinge functions are combined.

The MARS model

MARS builds models of the form

{\hat {f}}(x)=\sum _{i=1}^{k}c_{i}B_{i}(x)

.

The model is a weighted sum of basis functions $B_{i}(x)$ . Each $c_{i}$ is a constant coefficient. For example, each line in the formula for ozone above is one basis function multiplied by its coefficient.

Each basis function $B_{i}(x)$ takes one of the following three forms:

1) a constant 1. There is just one such term, the intercept. In the ozone formula above, the intercept term is 5.2.

2) a hinge function. A hinge function has the form $\max(0,x-const)$ or $\max(0,const-x)$ . MARS automatically selects variables and values of those variables for knots of the hinge functions. Examples of such basis functions can be seen in the middle three lines of the ozone formula.

3) a product of two or more hinge functions. These basis function can model interaction between two or more variables. An example is the last line of the ozone formula.

Hinge functions

Hinge functions are a key part of MARS models. A hinge function takes the form

\max(0,x-c)

or

\max(0,c-x)

where $c$ is a constant, called the knot. The figure on the right shows a mirrored pair of hinge functions with a knot at 3.1.

A hinge function is zero for part of its range, so can be used to partition the data into disjoint regions, each of which can be treated independently. Thus for example a mirrored pair of hinge functions in the expression

6.1\max(0,x-13)-3.1\max(0,13-x)

creates the piecewise linear graph shown for the simple MARS model in the previous section.

One might assume that only piecewise linear functions can be formed from hinge functions, but hinge functions can be multiplied together to form non-linear functions.

Hinge functions are also called hockey stick functions. Instead of the $\max$ notation used in this article, hinge functions are often represented by $[\pm (x_{i}-c)]_{+}$ where $[]_{+}$ means take the positive part.

The model building process

MARS builds a model in two phases: the forward and the backward pass. This two stage approach is the same as that used by recursive partitioning trees.

The forward pass

MARS starts with a model which consists of just the intercept term (which is the mean of the response values).

MARS then repeatedly adds basis function in pairs to the model. At each step it finds the pair of basis functions that gives the maximum reduction in sum-of-squares residual error (it is a greedy algorithm). The two basis functions in the pair are identical except that a different side of a mirrored hinge function is used for each function. Each new basis function consists of a term already in the model (which could perhaps be the intercept i.e. a constant 1) multiplied by a new hinge function. A hinge function is defined by a variable and a knot, so to add a new basis function, MARS must search over all combinations of the following:

1) existing terms (called parent terms in this context)

2) all variables (to select one for the new basis function)

3) all values of each variable (for the knot of the new hinge function).

This process of adding terms continues until the change in residual error is too small to continue or until the maximum number of terms is reached. The maximum number of terms is specified by the user before model building starts.

The search at each step is done in a brute force fashion, but a key aspect of MARS is that because of the nature of hinge functions the search can be done relatively quickly using a fast least-squares update technique. Actually, the search is not quite brute force. The search can be sped up with a heuristic that reduces the number of parent terms to consider at each step ("Fast MARS" ^[4]).

The backward pass

The forward pass usually builds an overfit model. (An overfit model has a good fit to the data used to build the model but will not generalize well to new data.) To build a model with better generalization ability, the backward pass prunes the model. It removes terms one by one, deleting the least effective term at each step until it finds the best submodel. Model subsets are compared using the GCV criterion described below.

The backward pass has an advantage over the forward pass: at any step it can choose any term to delete, whereas the forward pass at each step can only see the next pair of terms.

The forward pass adds terms in pairs, but the backward pass typically discards one side of the pair and so terms are often not seen in pairs in the final model. A paired hinge can be seen in the equation for ${\hat {y}}$ in the first MARS example above; there are no complete pairs retained in the ozone example.

Generalized cross validation (GCV)

The backward pass uses GCV to compare the performance of model subsets in order to choose the best subset: lower values of GCV are better. The GCV is a form of regularization: it trades off goodness-of-fit against model complexity.

(We want to estimate how well a model performs on new data, not on the training data. Such new data is usually not available at the time of model building, so instead we use GCV to estimate what performance would be on new data. The raw residual sum-of-squares (RSS) on the training data is inadequate for comparing models, because the RSS always increases as MARS terms are dropped. In other words, if the RSS were used to compare models, the backward pass would always choose the largest model—but the largest model typically does not have the best generalization performance.)

The formula for the GCV is

GCV = RSS / (N * (1 - EffectiveNumberOfParameters / N)^2)

where RSS is the residual sum-of-squares measured on the training data and N is the number of observations (the number of rows in the x matrix).

The EffectiveNumberOfParameters is defined in the MARS context as

EffectiveNumberOfParameters = NumberOfMarsTerms + Penalty * (NumberOfMarsTerms - 1 ) / 2

where Penalty is about 2 or 3 (the MARS software allows the user to preset Penalty).

Note that (NumberOfMarsTerms - 1 ) / 2 is the number of hinge-function knots, so the formula penalizes the addition of knots. Thus the GCV formula adjusts (i.e. increases) the training RSS to take into account the flexibility of the model. We penalize flexibility because models that are too flexible will model the specific realization of noise in the data instead of just the systematic structure of the data.

Generalized Cross Validation is so named because it uses a formula to approximate the error that would be determined by leave-one-out validation. It is just an approximation but works well in practice. GCVs were introduced by Craven and Wahba and extended by Friedman for MARS.

Constraints

One constraint has already been mentioned: the user can specify the maximum number of terms in the forward pass.

A further constraint can be placed on the forward pass by specifying a maximum allowable degree of interaction. Typically only one or two degrees of interaction are allowed, but higher degrees can be used when the data warrants it. The maximum degree of interaction in the first MARS example above is one (i.e. no interactions or an additive model); in the ozone example it is two.

Other constraints on the forward pass are possible. For example, the user can specify that interactions are allowed only for certain input variables. Such constraints could make sense because of knowledge of the process that generated the data.

Pros and cons

No regression modeling technique is best for all situations. The guidelines below are intended to give an idea of the pros and cons of MARS, but there will be exceptions to the guidelines. It is useful to compare MARS to recursive partitioning and this is done below. (Recursive partitioning is also commonly called regression trees, decision trees, or CART; see the recursive partitioning article for details).

MARS models are more flexible than linear regression models.

MARS models are simple to understand and interpret. Compare the equation for ozone concentration above to, say, the innards of a trained neural network or a random forest.

MARS can handle both continuous and categorical data.^[5] MARS tends to be better than recursive partitioning for numeric data because hinges are more appropriate for numeric variables than the piecewise constant segmentation used by recursive partitioning.

Building MARS models often requires little or no data preparation. The hinge functions automatically partition the input data, so the effect of outliers is contained. In this respect MARS is similar to recursive partitioning which also partitions the data into disjoint regions, although using a different method. (Nevertheless, as with most statistical modeling techniques, known outliers should be considered for removal before training a MARS model.)

MARS (like recursive partitioning) does automatic variable selection (meaning it includes important variables in the model and excludes unimportant ones). However, bear in mind that variable selection is not a clean problem and there is usually some arbitrariness in the selection, especially in the presence of collinearity and 'concurvity'.

MARS models tend to have a good bias-variance trade-off. The models are flexible enough to model non-linearity and variable interactions (thus MARS models have fairly low bias), yet the constrained form of MARS basis functions prevents too much flexibility (thus MARS models have fairly low variance).

MARS is suitable for handling fairly large datasets. It is a routine matter to build a MARS model from an input matrix with, say, 100 predictors and 10⁵ observations. Such a model can be built in about a minute on a 1 GHz machine, assuming the maximum degree of interaction of MARS terms is limited to one (i.e. additive terms only). A degree two model with the same data on the same 1 GHz machine takes longer—about 12 minutes. Be aware that these times are highly data dependent. Recursive partitioning is much faster than MARS.

With MARS models, as with any non-parametric regression, parameter confidence intervals and other checks on the model cannot be calculated directly (unlike linear regression models). Cross-validation and related techniques must be used for validating the model instead.

MARS models do not give as good fits as boosted trees, but can be built much more quickly and are more interpretable. (An 'interpretable' model is in a form that makes it clear what the effect of each predictor is.)

The earth, mda, and polspline implementations do not allow missing values in predictors, but free implementations of regression trees (such as rpart and party) do allow missing values using a technique called surrogate splits.

MARS models can make predictions quickly. The prediction function simply has to evaluate the MARS model formula. Compare that to making a prediction with say a Support Vector Machine, where every variable has to be multiplied by the corresponding element of every support vector. That can be a slow process if there are many variables and many support vectors.

Software

Free

Several R packages fit MARS-type models:
- earth function in the earth package
- mars function in the mda package
- polymars function in the polspline package. Not Friedman's MARS.
Matlab code:
- ARESLab: Adaptive Regression Splines toolbox for Matlab
Python
- Earth - Multivariate adaptive regression splines
- py-earth

Commercial

MARS from Salford Systems. Based on Friedman's implementation.
STATISTICA Data Miner from StatSoft
ADAPTIVEREG from SAS.

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

Rouse number: Difference between revisions

Revision as of 16:31, 6 July 2013

Contents

The basics

The MARS model

Hinge functions

The model building process

The forward pass

The backward pass

Generalized cross validation (GCV)

Constraints

Pros and cons

See also

Software

Free

Commercial

References

Further reading

Navigation menu

@@ Line 1: / Line 1: @@
-\ո
+'''Multivariate adaptive regression splines (MARS)''' is a form of [[regression analysis]] introduced by [[Jerome H. Friedman]] in 1991.<ref>{{cite doi|10.1214/aos/1176347963}}</ref> It is a non-parametric regression technique
-Ҟnowledge is extremely important to great diet decisions. In case you have no idea, you can not make clever choiceѕ. Ƭhe ɗetails supplіеd right here will assist you to get to be the most healthy you'll be.<br><br>Whole grains are a part of a hеalthy and well-balanced diet. Individuals ѡho consumе much more grain аre usually healthier than individuals in whose diet contains procеssed carbohydrates. Put in food products to what you eat that are created fгom 100% whole gгains. Your diǥeѕtive tract will thank you foг ingesting whole gгain prodսcts Ьy operatіng nicely. Whole grain prodսcts offer a steaԁіеr vitality suρply. Enhanceɗ cɑrbohydrates give you a fast burst open of electricity along with a incгease іn blood sugar levеls, but this can be then ɑ fаst ԁrop in power.<br><br>Ԝhite-colоred flour goods can eaѕily be substitսted with total-grain or whole-dinner products. Whole grain products, еspecially the wheat ߋr grain selection, have increased fibers and proteins information than enhanced աhole grains. Wholeɡrain food items is able to reduce cholesterol levels and keep you sensation complete. Check the dietary detaіls to еnsure a mаjor component is listеd as "entire".<br><br>Want to lose exϲesѕ weight and also be more Һeаlthy? If you have, you  [http://Audiono.com/auctions/item.php?id=95736&mode=1 rate vigrx plus] ought to be certain your digestive system is being employed as it must. Be sure you stay hydrated, acquire dietary fiber, and have proƅiotics in what you eat.<br><br>ӏn order to buy any equippеd foods, takе a shut consider the label initіal. Loѡered body fat merchandise may still include substantial ԛuantities of sodium, ɡlucose or other poor preservatives. Very refined food are not perfect for weight loss. By looҡing at the tag, it is possible to know the diffеrent extra fat, cholestrerol levels, and ѕweets items in the meals. Unnatural components should be elіminated.<br><br>Highly paсkaged cereals typically change grain dսе to the fact packaged grain are tastieг. Sure, brigҺt wɦite flour can be a far better οptiߋn [http://auction.rleeermey.com/item.php?id=111350&mode=1 Vigrx Plus Directions For Use] many prepared products. On the other hand, cereals give yօu a more technical, gratifying flavor and cߋnsiѕt of fiber.<br><br>Do yоս wish to reduce the volume of red meat in your daily dіet? Use condiments as alternatives in your dish. Use red meat to include a lot more taste in your vegetables and also ߋther healthier foօd. Asian advert [http://www.Adobe.com/cfusion/search/index.cfm?term=&Mediterranean&loc=en_us&siteSection=home Mediterranean] civilizations try this and possess decreased thеir cardiovascular disease-relеvаnt sitսatiоns.<br><br>Generating your οwn drinks is really a fun, eɑsy way to cooҟ a deliciօus treat. You can also makе sure thеy are a lot more nutritiouѕ. Trу introducing omega-3 natural օilѕ to the mixture for tɦe contra --oxidant qualities and hеalth advantagеs. Cocoa powder аlso functions. These can preѕent you with the nutrients that you need whilst delivering vitamin antioxidants also.<br><br>Yoս can use essential oliѵe oil to imρrߋve the texture and apρearɑnce of the eƿidermis. Eѕsentіal olive oil helps carefully seal off in essentіal moisture cߋntent on your own  [http://auktion.co.in/item.php?id=5016&mode=1 vigrx plus how long does it take to work] hands and faсe. Furthermore, it provides yoս with antioxidants to battle from getting older pores and skіn. A slim cоating twiсe a day is everything required.<br><br>Ensure that your meal plan features a ѵariety of diveгse nutrients tҺat ɑre ingested sparingly. Consuming too much  negative reviews on vigrх plus ([http://Audiono.com/auctions/item.php?id=95862&mode=1 Audiono.com]) is poor. Also, this could place ƴou at the greateг risk for sevеral іllneѕses.<br><br>When your menu allows for your choice of nut, opt for walnuts. Almonds sսpply sеveral dietary benefits. They may be full of protein, assist in lowerіng cholesterol levels and in aɗdition motivate the production of blood vessels tіssues. Also, they may decrease the money which you devote since tҺey are not higɦ-ƿriced.<br><br>Whеn you produce a sandwich, attempt exchanging breads using tɦe wholegrain seeded loaves of bread. The glycemic list of wholegrain bread іs quite a bit reduce than it is in white-colored a loaf of bread. This assists always keep excess weiǥht in control, prevеnt heart problems, and keep you full. These breads also have dietary fiber and essеntial fatty acids, and therefore will help your abdomen operate.<br><br>Giving your whole body the right nutrients is very impօrtant to becoming Һealthful. Use what you've figureɗ out here to further improve your lifestyle. Dependant upon the diet you normally eat, this may be a huge modification for you. Understanding nutrition and how it influences you is essential to keep healthful and reside and long life.
+and can be seen as an extension of [[linear model]]s that
+automatically models non-linearities and interactions between variables.
+The term "MARS" is trademarked and licensed to Salford Systems. In order to avoid trademark infringements, many open source implementations of MARS are called "Earth".<ref>[http://cran.r-project.org/web/packages/earth/index.html CRAN Package earth]</ref><ref>[http://orange.biolab.si/blog/2011/12/20/earth-multivariate-adaptive-regression-splines/ Earth - Multivariate adaptive regression splines in Orange (Python machine learning library)]</ref>
+== The basics ==
+This section introduces MARS using a few examples.  We start with a set of data: a matrix of input variables ''x'', and a vector of the observed responses ''y'', with a response for each row in ''x''. For example, the data could be:
+{|
+! ''x''    !!  ''y''
+|-
+| 10.5 ||  16.4
+|-
+| 10.7 ||  18.8
+|-
+| 10.8 ||  19.7
+|-
+| ...  ||  ...
+|-
+| 20.6 ||  77.0
+|}
+Here there is only one [[Dependent and independent variables|independent variable]], so the ''x'' matrix is just a single column. Given these measurements, we would like to build a model which predicts the expected ''y'' for a given ''x''.
+[[File:Friedmans mars linear model.png|frame|right|A linear model]]
+A [[linear model]] for the above data is
+: <math>
+\hat{y} = -37 + 5.1 x
+</math>
+The hat on the <math>\hat{y}</math> indicates that <math>\hat{y}</math> is estimated from the data.  The figure on the right shows a plot of this function:
+a line giving the predicted <math>\hat{y}</math> versus ''x'', with the original values of ''y'' shown as red dots.
+The data at the extremes of ''x'' indicates that  the relationship between ''y'' and ''x'' may be non-linear (look at the red dots relative to the regression line at low and high values of ''x'').  We thus turn to MARS to automatically build a model taking into account non-linearities.  MARS software constructs a model from the given ''x'' and ''y'' as follows
+: <math>
+\begin{align}
+\hat{y} = &\ 25 \\
+& + 6.1 \max(0, x  - 13) \\
+& - 3.1 \max(0, 13 - x) \\
+\end{align}
+</math>
+[[File:Friedmans mars simple model.png|frame|right|A simple MARS model of the same data]]
+The figure on the right shows a plot of this function: the predicted <math>\hat{y}</math> versus ''x'', with the original values of y once again shown as red dots.  The predicted response is now a better fit to the original ''y'' values.
+MARS has automatically produced a kink
+in the predicted ''y'' to take into account non-linearity.
+The kink is produced by ''hinge functions''.
+The hinge functions are the expressions starting with <math>\max</math>
+(where <math>\max(a,b)</math>
+is <math>a</math> if <math>a > b</math>, else <math>b</math>).
+Hinge functions are described in more detail below.
+In this simple example, we can easily see from the plot that
+''y'' has a non-linear relationship with ''x''
+(and might perhaps guess that y varies with the square of ''x'').
+However, in general there will be multiple
+[[Dependent and independent variables|independent variables]],
+and the relationship between ''y'' and these variables will be unclear
+and not easily visible by plotting.
+We can use MARS to discover that non-linear relationship.
+An example MARS expression with multiple variables is
+: <math>
+\begin{align}
+  \mathrm{ozone} = &\ 5.2 \\
+&      +    0.93 \max(0, \mathrm{temp} - 58)  \\
+&      -   0.64 \max(0, \mathrm{temp} - 68)  \\
+&      -   0.046 \max(0,  234 - \mathrm{ibt})  \\
+&      -   0.016 \max(0, \mathrm{wind} - 7) \max(0, 200 - \mathrm{vis})\\
+\end{align}
+</math>
+[[File:Friedmans mars ozone model.png|frame|right|Variable interaction in a MARS model]]
+This expression models air pollution (the ozone level)
+as a function of the temperature and a few other variables.
+Note that the last term in the formula (on the last line)
+incorporates an interaction between <math>\mathrm{wind}</math>
+and <math>\mathrm{vis}</math>.
+The figure on the right plots the predicted
+<math>\mathrm{ozone}</math> as <math>\mathrm{wind}</math> and
+<math>\mathrm{vis}</math> vary,
+with the other variables fixed at their median values.
+The figure shows that wind does not affect the ozone
+level unless visibility is low.
+We see that MARS can build quite flexible regression surfaces
+by combining hinge functions.
+To obtain the above expression, the MARS model building procedure
+automatically selects which variables to use (some variables are
+important, others not), the positions of the kinks in the hinge
+functions, and how the hinge functions are combined.
+== The MARS model ==
+MARS builds models of the form
+: <math>\hat{f}(x) = \sum_{i=1}^{k} c_i B_i(x) </math>.
+The model is a weighted sum of basis functions
+<math>B_i(x)</math>.
+Each <math>c_i</math> is a constant coefficient.
+For example, each line in the formula for ozone above is one basis function
+multiplied by its coefficient.
+Each [[basis function]]
+<math>B_i(x)</math>
+takes one of the following three forms:
+) a constant 1. There is just one such term, the intercept.
+In the ozone formula above, the intercept term is 5.2.
+) a ''hinge'' function.
+A hinge function has the form
+<math> \max(0, x - const) </math>
+or
+<math> \max(0, const - x) </math>.
+MARS automatically selects variables
+and values of those variables for knots of the hinge functions.
+Examples of such basis functions can be seen
+in the middle three lines of the ozone formula.
+) a product of two or more hinge functions.
+These basis function can model interaction between two or more variables.
+An example is the last line of the ozone formula.
+== Hinge functions ==
+[[File:Friedmans mars hinge functions.png|frame|right|A mirrored pair of hinge functions with a knot at x=3.1]]
+Hinge functions are a key part of MARS models.
+A hinge function takes the form
+: <math>\max(0,x-c)</math>
+or
+: <math>\max(0,c-x)</math>
+where <math>c</math> is a constant, called the ''knot''.
+The figure on the right shows a mirrored pair of hinge functions with a knot at 3.1.
+A hinge function is zero for part of its range, so
+can be used to partition the data into disjoint regions,
+each of which can be treated independently.
+Thus for example
+a mirrored pair of hinge functions in the expression
+: <math>
+.1 \max(0, x  - 13)
+- 3.1 \max(0, 13 - x)
+</math>
+creates the [[piecewise]] linear graph shown for the
+simple MARS model in the previous section.
+One might assume that only piecewise
+linear functions can be formed from hinge functions, but
+hinge functions can be multiplied together to form non-linear functions.
+Hinge functions are also called [[Ice hockey stick|hockey stick]] functions.
+Instead of the <math>\max</math> notation used in this article,
+hinge functions are often represented by
+<math>[\pm(x_i - c)]_+</math>
+where <math>[]_+</math> means take the positive part.
+== The model building process ==
+MARS builds a model in two phases:
+the forward and the backward pass.
+This two stage approach is the same as that used by
+[[recursive partitioning]] trees.
+=== The forward pass ===
+MARS starts with a model which consists of just the intercept term
+(which is the mean of the response values).
+MARS then repeatedly adds basis function in pairs to the model.
+At each step it finds the pair of basis functions that
+gives the maximum reduction in sum-of-squares
+[[Errors and residuals in statistics|residual]] error
+(it is a [[greedy algorithm]]).
+The two basis functions in the pair
+are identical except that a different
+side of a mirrored hinge function is used for each function.
+Each new basis function consists of
+a term already in the model
+(which could perhaps be the intercept i.e. a constant 1)
+multiplied by a new hinge function.
+A hinge function is defined by a variable and a knot,
+so to add a new basis function, MARS must search over
+all combinations of the following:
+) existing terms (called ''parent terms'' in this context)
+) all variables (to select one for the new basis function)
+) all values of each variable (for the knot of the new hinge function).
+This process of adding terms continues until
+the change in residual error is too small to continue
+or until the maximum number of terms is reached.
+The maximum number of terms
+is specified by the user before model building starts.
+The search at each step is done in a [[Brute-force search|brute force]] fashion,
+but a key aspect of MARS is that
+because of the nature of hinge functions
+the search can be done relatively
+quickly using a fast least-squares update technique.
+Actually, the search is not quite brute force.
+The search can be sped up with  a [[Heuristics|heuristic]]
+that reduces the number
+of parent terms to consider at each step
+("Fast MARS"
+<ref>[[Friedman, J. H.]] (1993)
+''Fast MARS'',
+Stanford University Department of Statistics, Technical Report 110
+</ref>).
+=== The backward pass ===
+The forward pass usually builds an [[overfit]] model.
+(An overfit model has a good fit to the data used to build
+the model but will not generalize well to new data.)
+To build a model with better generalization ability,
+the backward pass prunes the model.
+It removes terms one by one,
+deleting the least effective term at each step
+until it finds the best submodel.
+Model subsets are compared using the GCV criterion described below.
+The backward pass has an advantage over the forward pass:
+at any step it can choose any term to delete,
+whereas the forward pass
+at each step can only see the next pair of terms.
+The forward pass adds terms in pairs,
+but the backward pass typically discards one side of the pair
+and so terms are often not seen in pairs in the final model.
+A paired hinge can be seen in
+the equation for <math>\hat{y}</math> in the
+first MARS example above;
+there are no complete pairs retained in the ozone example.
+=== Generalized cross validation (GCV) ===
+The backward pass uses GCV to compare the performance of model subsets in order to choose the best subset: lower values of GCV are better.
+The GCV is a form of
+[[Regularization (machine learning)|regularization]]:
+it trades off goodness-of-fit against model complexity.
+(We want to estimate how well a model performs on ''new'' data, not on the training data.  Such new data is usually not available at the time of model building, so instead we use GCV to estimate what performance would be on new data.  The raw [[Residual sum of squares|residual sum-of-squares]] (RSS) on the training data is inadequate for comparing models, because the RSS always increases as MARS terms are dropped.  In other words, if the RSS were used to compare models, the backward pass would always choose the largest model—but the largest model typically does not have the best generalization performance.)
+The formula for the GCV is
+'''GCV = RSS / (N * (1 - EffectiveNumberOfParameters / N)^2)'''
+where '''RSS''' is the residual sum-of-squares
+measured on the training data and '''N''' is the
+number of observations (the number of rows in the '''x''' matrix).
+The '''EffectiveNumberOfParameters''' is defined in
+the MARS context as
+'''EffectiveNumberOfParameters = NumberOfMarsTerms + Penalty * (NumberOfMarsTerms - 1 ) / 2'''
+where '''Penalty''' is about 2 or 3 (the
+MARS software allows the user to preset Penalty).
+Note that
+'''(NumberOfMarsTerms - 1 ) / 2'''
+is the number of hinge-function knots,
+so the formula penalizes the addition of knots.
+Thus the GCV formula adjusts (i.e. increases) the training RSS to take into
+account the flexibility of the model.
+We penalize flexibility because models that are too flexible will model the specific realization of noise in the data instead of just the systematic structure of the data.
+Generalized Cross Validation is so named because
+it uses a formula to approximate the error
+that would be determined by leave-one-out validation.
+It is just an approximation but works well in practice.
+GCVs were introduced by Craven and
+[[Grace Wahba|Wahba]] and extended by Friedman for MARS.
+=== Constraints ===
+One constraint has already been mentioned: the user
+can specify the maximum number of terms in the forward pass.
+A further constraint can be placed on the forward pass
+by specifying a maximum allowable degree of interaction.
+Typically only one or two degrees of interaction are allowed,
+but higher degrees can be used when the data warrants it.
+The maximum degree of interaction in the first MARS example
+above is one (i.e. no interactions or an ''additive model'');
+in the ozone example it is two.
+Other constraints on the forward pass are possible.
+For example, the user can specify that interactions are allowed
+only for certain input variables.
+Such constraints could make sense because of knowledge
+of the process that generated the data.
+== Pros and cons ==
+No regression modeling technique is best for all situations.
+The guidelines below are intended to give an idea of the pros and cons of MARS,
+but there will be exceptions to the guidelines.
+It is useful to compare MARS to [[recursive partitioning]] and this is done below.
+(Recursive partitioning is also commonly called ''regression trees'',
+''decision trees'', or [[Predictive_analytics#Classification_and_regression_trees|CART]];
+see the [[Decision tree learning|recursive partitioning]] article for details).
+*MARS models are more flexible than [[linear regression]] models.
+*MARS models are simple to understand and interpret.  Compare the equation for ozone concentration above to, say, the innards of a trained [[Artificial neural network|neural network]] or a [[random forest]].
+*MARS can handle both continuous and categorical data.<ref>[[Friedman, J. H.]] (1993) ''Estimating Functions of Mixed Ordinal and Categorical Variables Using Adaptive Splines'', New Directions in Statistical Data Analysis and Robustness (Morgenthaler, Ronchetti, Stahel, eds.), Birkhauser</ref> MARS tends to be better than recursive partitioning for numeric data because hinges are more appropriate for numeric variables than the piecewise constant segmentation used by recursive partitioning.
+*Building MARS models often requires little or no data preparation. The hinge functions automatically partition the input data, so the effect of outliers is contained.  In this respect MARS is similar to [[recursive partitioning]] which also partitions the data into disjoint regions, although using a different method.  (Nevertheless, as with most statistical modeling techniques, known outliers should be considered for removal before training a MARS model.)
+*MARS (like recursive partitioning) does ''automatic variable selection'' (meaning it includes important variables in the model and excludes unimportant ones). However, bear in mind that variable selection is not a clean problem and there is usually some arbitrariness in the selection, especially in the presence of [[Multicollinearity|collinearity]] and 'concurvity'.
+*MARS models tend to have a good bias-variance trade-off.  The models are flexible enough to model non-linearity and variable interactions (thus MARS models have fairly low bias), yet the constrained form of MARS basis functions prevents too much flexibility (thus MARS models have fairly low variance).
+*MARS is suitable for handling fairly large datasets.  It is a routine matter to build a MARS model from an input matrix with, say, 100 predictors and 10<sup>5</sup> observations.  Such a model can be built in about a minute on a 1&nbsp;GHz machine, assuming the maximum degree of interaction of MARS terms is limited to one (i.e. additive terms only).  A degree two model with the same data on the same 1&nbsp;GHz machine takes longer—about 12 minutes.  Be aware that these times are highly data dependent. Recursive partitioning is much faster than MARS.
+*With MARS models, as with any non-parametric regression, parameter confidence intervals and other checks on the model cannot be calculated directly (unlike [[linear regression]] models).  [[Cross-validation (statistics)|Cross-validation]] and related techniques must be used for validating the model instead.
+*MARS models do not give as good fits as [[Boosting (meta-algorithm)|boosted]] trees, but can be built much more quickly and are more interpretable. (An 'interpretable' model is in a form that makes it clear what the effect of each predictor is.)
+*The <code>earth</code>, <code>mda</code>, and <code>polspline</code> implementations do not allow missing values in predictors, but free implementations of regression trees (such as <code>rpart</code> and <code>party</code>) do allow missing values using a technique called surrogate splits.
+*MARS models can make predictions quickly.  The prediction function simply has to evaluate the MARS model formula.  Compare that to making a prediction with say a [[Support Vector Machine]], where every variable has to be multiplied by the corresponding element of every support vector.  That can be a slow process if there are many variables and many support vectors.
+== See also ==
+* [[Linear regression]]
+* [[Segmented regression]]
+* [[Generalized linear model]]s (GLMs) can be incorporated into MARS models by applying a link function after the MARS model is built. Thus, for example, MARS models can incorporate [[logistic regression]] to predict probabilities.
+* [[Nonlinear regression|Non-linear regression]] is used when the underlying form of the function is known and regression is used only to estimate the parameters of that function. MARS, on the other hand, estimates the functions themselves, albeit with severe constraints on the nature of the functions. (These constraints are necessary because discovering a model from the data is an [[inverse problem]] that is not [[Well-posed problem|well-posed]] without constraints on the model.)
+* [[Recursive partitioning]] (commonly called CART). MARS can be seen as a generalization of recursive partioning that allows the model to better handle numerical (i.e. non-categorical) data.
+* [[Generalized additive model]]s. From the user's perspective GAMs are similar to MARS but (a) fit smooth [[Local regression|loess]] or polynomial [[Spline (mathematics)|splines]] instead of MARS basis functions, and (b) do not automatically model variable interactions. The fitting method used internally by GAMs is very different from that of MARS.  For models that do not require automatic discovery of variable interactions GAMs often compete favorably with MARS.
+* [[Rational function modeling]]
+* [[Spline interpolation]]
+* [[TSMARS]]. Time Series Mars is the term used when MARS models are applied in a time series context. Typically in this set up the predictors are the lagged time series values resulting in autoregressive spline models. These models and extensions to include moving average spline models are described in "Univariate Time Series Modelling and Forecasting using TSMARS: A study of threshold time series autoregressive, seasonal and moving average models using TSMARS", http://www.amazon.com/Univariate-Series-Modelling-Forecasting-TSMARS/dp/3838335953
+== Software ==
+=== Free ===
+* Several [[R (programming language)|R]] packages fit MARS-type models:
+** <code>earth</code> function in the <code>[http://cran.r-project.org/web/packages/earth/index.html earth]</code> package
+** <code>mars</code> function in the <code>[http://cran.r-project.org/web/packages/mda/index.html mda]</code> package
+** <code>polymars</code> function in the <code>[http://cran.r-project.org/web/packages/polspline/index.html polspline]</code> package.  Not Friedman's MARS.
+* Matlab code:
+** [http://www.cs.rtu.lv/jekabsons/regression.html ARESLab: Adaptive Regression Splines toolbox for Matlab]
+* Python
+** [http://orange.biolab.si/blog/2011/12/20/earth-multivariate-adaptive-regression-splines/ Earth - Multivariate adaptive regression splines]
+** [https://github.com/jcrudy/py-earth/ py-earth]
+=== Commercial ===
+* [http://www.salfordsystems.com/mars.php MARS] from Salford Systems. Based on Friedman's implementation.
+* [http://statsoft.com/products/data-mining-solutions/ STATISTICA Data Miner] from StatSoft
+* [http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_adaptivereg_overview.htm ADAPTIVEREG] from SAS.
+== References ==
+{{reflist}}
+== Further reading ==
+* Hastie T., Tibshirani R., and Friedman J.H. (2009) [http://www-stat.stanford.edu/~tibs/ElemStatLearn ''The Elements of Statistical Learning''], 2nd edition. Springer, ISBN 978-0-387-84857-0 (has a section on MARS)
+* Faraway J. (2005) [http://www.maths.bath.ac.uk/~jjf23 ''Extending the Linear Model with R''], CRC, ISBN 978-1-58488-424-8 (has an example using MARS with R)
+* Heping Zhang and Burton H. Singer (2010) [http://www.amazon.com/Recursive-Partitioning-Applications-Springer-Statistics/dp/1441968237 ''Recursive Partitioning and Applications''], 2nd edition. Springer, ISBN 978-1-4419-6823-4 (has a chapter on MARS and discusses some tweaks to the algorithm)
+* Denison D.G.T., Holmes C.C., Mallick B.K., and Smith A.F.M. (2004) [http://www.stat.tamu.edu/~bmallick/wileybook/book_code.html ''Bayesian Methods for Nonlinear Classification and Regression''], Wiley, ISBN 978-0-471-49036-4
+* Berk R.A. (2008) ''Statistical learning from a regression persepective'', Springer, ISBN 978-0-387-77500-5
+[[Category:Regression analysis]]
+[[Category:Machine learning]]

Rouse number: Difference between revisions

Revision as of 16:31, 6 July 2013

The basics

The MARS model

Hinge functions

The model building process

The forward pass

The backward pass

Generalized cross validation (GCV)

Constraints

Pros and cons

See also

Software

Free

Commercial

References

Further reading

Navigation menu

Search