Till startsida
To content Read more about how we use cookies on gu.se

PREHAB quality procedures for modelling

The modelling in PREHAB was done by different people using data from several different case-study areas but with a common set of routines. These routines involved a) initial quality control of predictor and response data prior to modelling (using procedures described in Zuur, A.F., Ieno, E.N. & Elphick, C.S. (2010) A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1, 3-14.), b) routines for splitting data into training and test data and c) optimisation of models.

Quality control using routines in R

  1. Correlation matrices and box-plots were used to explore graphically outliers and correlations among predictors using the routine scatterplot.matrix from library(car) Seriousely skewed data were transformed to improve normality using mainly log-transformations. Identification of outliers prompted additional checks of data and transfromation to improve distributional properties.
  2. Test for collinearity among predictors were done using the routine corvif from the library(AED). Variables with VIF-values above 3 were removed.
  3. Fitted models were tested for spatial autocorrelation of residuals using the function c. spline.correlog from library (ncf).

External validation - routines for splitting data

Data was split into a (1) training- and a (2) test set for fitting the model (1) and for external validation (2). 70% of the total data set was assigned as training data while the remaining 30% was used for testing models. Both categorical and quantitative response variables variables were split in a way to maximise homogeneity in frequency between sets. For quantitative data, use the same criterion as above, i.e. homogeneous frequencies in training and test data.

Optimisation of models

In general the different techniques were applied using “default” settings for their respective R-routines. However, initial modelling showed that each method needed an individual definition with respect to (1) model selection, (2) inclusion of interactions and (3) measures of variable importance.

GAM, library (mgcv)

  1. Automatic model selection using gam() from library(mgcv).
  2. No interaction terms.
  3. Chi-square value was used as the measure of variable importance for each predictor


Random forest, library (randomForest)

  1. None needed
  2. Interactions are part of the technique
  3. Varimp()


MARS, library (earth)

  1. The MARS model optimises itself, by selecting the most appropriate predictors as a part of model fitting, it also uses GCV to penalise for too large a model (in effect does the same as AIC).
  2. Use up to level 2.
  3. Use evimp()


Maxent (using Maxent software)

  1. Regularization was used for variables of importance otherwise variables were removed.
  2. No interaction terms.
  3. Variable importance is automatic output.


© University of Gothenburg, Sweden Box 100, S-405 30 Gothenburg
Phone +46 31-786 0000, About the website

| Map

The University of Gothenburg uses cookies to provide you with the best possible user experience. By continuing on this website, you approve of our use of cookies.  What are cookies?