To content
Read more about how we use cookies on gu.se
# PREHAB quality procedures for modelling

### Quality control using routines in R

### External validation - routines for splitting data

### Optimisation of models

- PREHAB
- Mapping
- Which methods are useful?
- PREHAB quality procedures

The modelling in PREHAB was done by different people using data from several different case-study areas but with a common set of routines. These routines involved a) initial quality control of predictor and response data prior to modelling (using procedures described in Zuur, A.F., Ieno, E.N. & Elphick, C.S. (2010) *A protocol for data exploration to avoid common statistical problems. *Methods in Ecology and Evolution, 1, 3-14.), b) routines for splitting data into training and test data and c) optimisation of models.

- Correlation matrices and box-plots were used to explore graphically outliers and correlations among predictors using the routine scatterplot.matrix from library(car) Seriousely skewed data were transformed to improve normality using mainly log-transformations. Identification of outliers prompted additional checks of data and transfromation to improve distributional properties.
- Test for collinearity among predictors were done using the routine corvif from the library(AED). Variables with VIF-values above 3 were removed.
- Fitted models were tested for spatial autocorrelation of residuals using the function c. spline.correlog from library (ncf).

Data was split into a (1) training- and a (2) test set for fitting the model (1) and for external validation (2). 70% of the total data set was assigned as training data while the remaining 30% was used for testing models. Both categorical and quantitative response variables variables were split in a way to maximise homogeneity in frequency between sets. For quantitative data, use the same criterion as above, i.e. homogeneous frequencies in training and test data.

In general the different techniques were applied using “default” settings for their respective R-routines. However, initial modelling showed that each method needed an individual definition with respect to (1) model selection, (2) inclusion of interactions and (3) measures of variable importance.

**GAM, library (mgcv)**

- Automatic model selection using gam() from library(mgcv).
- No interaction terms.
- Chi-square value was used as the measure of variable importance for each predictor

**Random forest, library (randomForest)**

- None needed
- Interactions are part of the technique
- Varimp()

**MARS, library (earth)**

- The MARS model optimises itself, by selecting the most appropriate predictors as a part of model fitting, it also uses GCV to penalise for too large a model (in effect does the same as AIC).
- Use up to level 2.
- Use evimp()

**Maxent (using Maxent software)**

- Regularization was used for variables of importance otherwise variables were removed.
- No interaction terms.
- Variable importance is automatic output.