To content
Read more about how we use cookies on gu.se
# Description of methods

## Pros and cons

- PREHAB
- Mapping
- (Old mapping)
- Vegetation
- Mapping distribution
- 0.1.1.3. Which methods?
- Making the model
- Modelling techniques

**There are several methods available for determining the mathematical relationship between environmental data and response variable - i.e. finding your predicting model. PREHAB has evaluated four of them.**

We have evaluated four statistichal methods for predictive modelling: Generalized additive models (GAM), Multivariate adaptive regression splines (MARS), Random Forest (RF) and Maximum entropy (MaxEnt). MaxEnt was evaluated for predicting distribution only, the rest for both distribution and abundance.

Short description of evaluated methods

**Generalized additive models (GAM)**

GAM are semi-parametric extensions of generalized linear models useful for fitting non-linear relationships without prior assumptions on the shape of the response variable (Wood, 2006). Several methods are given to estimate the amount of smoothing for continuous environmental variables (predictors) in order to find the best fit in a data and maintain ecologically interpretable models (Wood and Augustin, 2002). It includes an integrated model selection that can completely penalize predictor variables out of the model. Relative importance of each predictor can be determined by Chi-square values, whereas in case of presence of categorical variables in the model, relative importance of variables can be manually assessed by calculating absolute differences in Akaike information criteria between a full model and a model without each predictor. In case of violation assumptions of equal variances and independence GAM can be extended to Generalized additive mixed models (Zuur et al., 2009). GAM are available in R by packages: “gam”, “mgcv”.

**Multivariate adaptive regression splines (MARS)**

MARS are a flexible data driven nonparametric method of fitting regression models, incorporating the benefits of the recursive partitioning approach, whilst producing a continuous model (Friedman, 1991). Models can be built using the Generalized linear model approach. Relative importance of environmental variables is determined from the decrease in the generalized cross validation. MARS do automatic variable selection, although, there is usually some arbitrariness in the selection, especially in the presence of collinearity independent variables. MARS are available in R by packages: “earth”, “mda”, ”polspline”.

**Random forest or random forests (RF)**

RF is an ensemble method where a large number of decision trees (typically 500-1000) are built and responses are predicted based on majority rules (classification) from all trees (Breiman, 2001; Cutler et al., 2007). In comparison to traditional classification trees (Breiman et al., 1993), the main advantages are that it produces more accurate predictions, it is easier to use as it requires no pruning. Previous studies have shown that its performance compares favourably with other novel methods for prediction (e.g. Prasad et al., 2006; Knudby et al., 2010; Collin et al., 2011). However, RF has been observed to overfit for some datasets with noisy classification tasks. It can handle thousands of input environmental variables without variable deletion and give estimates of what predictors are important in the classification. RF has also methods for balancing error in class population unbalanced data sets. RF is biased in favour of those categorical environmental variables which attributes with more levels, therefore, the predictor importance scores are not reliable for this type of data. Generated forests can be saved for future use on other data. RF is available in R by packages: “randomForest”, “party”, “obliqueRF”.

**Maximum entropy (MaxEnt)**

MaxEnt is a general-purpose method for making predictions or inferences from incomplete information (Phillips, 2006). It estimates a target’s probability distribution based on a distribution of maximum entropy (most spread out or closest to uniform) under the constraint that the expected value of each environmental variable under this estimated distribution matches its empirical mean. This method is equivalent to finding the maximum-likelihood distribution of a species. MaxEnt offers two main advantages over other species distribution modelling methods. Firstly it is a generative approach, rather than discriminative, which can be an inherent advantage when the amount of training data is limited (Elith et al., 2006; Hernandez et al., 2006; Wisz et al., 2008; Peterson et al., 2007). Secondly, it can work with presence only data, together with environmental information for the whole study area (Phillips, 2006). MaxEnt takes a sample of 10,000 pixels from the study region used in model calibration in order to characterize the “background” of environments available to the species, hence, it is termed a presence–background technique (Heumann et al., 2011). To reduce the tendency to create models that are overfit, MaxEnt employs what is termed regularization to penalize strong weights given to features.

**Reference** to sources/articles describing additional methods?

Author: Martynas

All selected/evaluated modelling methods have weaknesses and strengths important to be aware of.