To content
Read more about how we use cookies on gu.se
# Sample size

**1. Sample size within a cell (n)**

**2. Number of independent samples (N)**

**Conclusions from PREHAB clearly show that the design of sampling programmes has a very strong influence on the success of models. One important aspect of the sampling design is the sample size. This involves defining appropriate spatial and temporal extent and resolution of the predictive mapping. Once these have been defined, sampling needs to be designed in a way that leads to desired precision within available resources. **

Regarding sample size we can distinguish between two types:

- Number of samples within a grid cell (n)
- Number of independent grid cells (sites, stations, locations, etc) that are used in the model (N)

If the whole grid cell is completely surveyed for the occurrence or abundance of a species, there is of course no sampling error at individual cells. If, however, occurrence or abundance is estimated from a proportion of the area, we can always expect a sampling error, which is dependent on how common the species is within the area.

**A. SAMPLING (n) FOR DISTRIBUTION**

If the aim of sampling is to estimate and model the **occurrence** of a species, the important problem is that of false absences, i.e. concluding that a species is not present when in fact it is. Assuming that we use an appropriate method for sampling, the probability of committing such errors is larger if n is small and thus a small proportion of the area is sampled, and if the species is rare (Fig. 1). As a rule of thumb it appears that SAMPLING? approximately 5 % of the area is sufficient for a 80 % probability of finding a species (i.e. a 0.2 probability of error), if the frequency of the species in the area is larger than 0.25. To achieve a similar degree of success for species found at a frequency of 0.1, approximately 10 samples are needed.

**Figure 1.** **Small sample size increases the probability for false absences (concluding that a present species is not present), especially if the species is rare. **Figure shows probability of false absences in 5 x 5 m grid cells using samples of 0.5 x 0.5 m as a function of number of samples (incidentally equal to the per cent area sampled). “p” is the probability of finding the species in one sample. Results are averages from 1000 repeated simulations of sampling in a 5 x 5 m area divided into 100 0.5 x 0.5 m quadrates.

**B. SAMPLING (n) FOR ABUNDANCE**

If the aim is to estimate **abundance**, i.e. the average percent cover of a species, the deviation between the true mean and the estimated mean (“the standar error”) is a measure or error. Figure 2a (below) illustrates a "worst-case scenario" of what would happen if we sampled an increasing number of samples at different mean percent cover, with the assumption that each individual sample can only have either 0 or 100% cover. Again the error is dependent on the sample size (n) and the abundance of the species. Maximum uncertainty occurs at an average cover of 50% where five samples result in an average error of 15-20% cover. At abundances of 10 and 25% the error is 11 and 16% cover. Thus, on theoretical grounds we can conclude that five samples (or 5% of the area) should in most cases be sufficient to achieve an error of 15% cover, and that 10 samples should suffice for an error of 10% cover.

**Figure 2a. Es****timation error depends on sample size and the abundance of the species - simulated data. **Figure shows average deviation (standard error) for estimates of percent cover in 5 x 5 m grid cells using samples of 0.5 x 0.5 m. The lines show different average cover as a function of number of samples (incidentally equal to the per cent area sampled). Results are averages from 1000 repeated **simulations** of sampling in a 5 x 5 m area divided into of 100 0.5 x 0.5 m quadrates.

**A Swedish study example**

In practice, however, it is likely that the precision is better than this. One example of a practical example is a study of algae coverage in 10x10 m quadrates sampled with 0.25x0.25 m photos on offshore banks in Sweden (Fig. 2b). Modelling the precision of the mean from observed estimates of spatial variability reveals that all types of vegetation (means ranging from 2-23%) have an average deviation of 5 % cover or less at a sample size of 5. As expected, this error is considerably smaller than that the “worst-case scenario” presented in figure 2a above.

**Figure 2b. Estimation error depends on sample size and the abundance of the species - real data. **Average deviation (standard error) for estimates of vegetation coverage in samples in a Swedish study containing 189 plots each sampled by 5 quadrates (0.25x0.25 m).

The requirements in terms of number of independent sites to be sampled (N) depends on the type of species, how common it is and in which environment it lives. Therefore, it is difficult to give simple recommendations on the number of sample sites needed to predict either occurrence or abundance. In general, a larger number of independent sites will always produce better models. This is not the same as to say that more data will always produce better models - samples must also be representative for the area to be mapped, see figure 3 below.

**Figure 3. Unrepresentative sampling (N) design may lead to severely biased models.** The figure illustrates an example where a large number of samples are taken within one particular area, which may not be representative to the rest of the area (circled). If we treat the samples from this area as independent, and combine them in a model with samples randomly distributed in the whole area, there is a possibility that the resulting model will be strongly biased and influenced by the particular conditions in the extensively sampled area. If models are to be based on existing data with such characteristics, it is important to evaluate potential effects of sampling bias and to use biased samples sensibly possibly even by sacrificing some data. (See also "Distribution of sampling efforts")

**A. SAMPLING (N) FOR DISTRIBUTION **If the aim of sampling is to estimate and model the

**Conclusion from PREHAB on sampling for models predicting biodiversity distribution**

In PREHAB, the distribution of benthic biodiversity was modelled with existing data from four case-study areas, using sample sizes (N) ranging from 125 (Vinga) to almost 1 500 (Östergötland). Samples were collected using a range of video approaches and diving.

Two main conclusions emerge from this Baltic-wide summary:

- the distribution of vegetation, inverterbrates and fish could be usefully modelled in all areas, at all sample sizes and with data from all sampling methods evaluated in PREHAB

- the consistency of model performance is improved at larger sample sizes. At small sample sizes there are some good models but also some which are not performing well. At large sample sizes all models performed well, see figure 4.

**Figure 4. Larger sample sizes (N) provide better model performance and thus more reliable maps of predicted biodiversity.** Figure shows model performance in relation to sample size (N) for predictions of vegetation distribution in PREHAB. AUC is a measure of model performance. A general rule of thumb is that 0.8 (dashed line) is a threshold for acceptable model performance.

**B. SAMPLING (N) FOR ABUNDANCE**

If the aim of sampling is to estimate and model the *abundance* of a species or habitat in a large area the number of independent samples can be expected to influence the precision of models greatly. Sampling requirements for models of abundance depends on the type of species, how common it is and in which environment it lives.

Firstly, in order to design a sampling programme for such models it may be useful to consider the overall precision of the sampling programme based on samples alone. Using standard methods the number of samples necessary to achieve a certain level of precision depends strongly on how abundant the species is (Figure 5). For example, at a mean cover of 30%, approximately 100 samples are needed in order to get an estimate that deviates by 20% or less. At 3% cover 1000 samples are needed. To get estimates that are 5% or less off the true mean, we need 2000 and 20 000 samples!

These rules of thumb can be used as a baseline where models can be used to assess the overall precision. In the (likely) event that useful models can be developed, these will only improve the precision of the overall abundance of the species AND provide more detailed infomation on its abundance in speciefic locations.

**Figure 5**.** The number of samples (N) necessary to achieve a certain level of precision depends strongly on how abundant the species is**. Figure shows the number of independent samples (N) needed to achieve a desired precision of 5 or 20% deviation from the overall mean. Estimates are based on data collected by video in different parts of Swedish coastal areas.

* *Conclusion from PREHAB on sampling for models predicting biodiversity abundance

In PREHAB, the abundance of invertebrates and vegetation was modelled with existing data from four case-study areas, using sample sizes (N) ranging from 125 (Vinga) to almost 450 (Lithuania). Similarly to models of distribution, the performance of models preciting abundance was more variable at smaller samples sizes but even at sample sizes around 100 some models performed relatively well, i.e. RMSE were 10-20% of the range.

**Figure 6. At small sample sizes (N) model performance is more variable and thus predictions are less reliable. **Figure shows model performance in relation to sample size (N) for predicted abundance of vegetation and invertebrates in PREHAB. NRMSE is measure of model performance and represents the average deviation (RMSE) standardised by the observed range of the variable. Models performing well have small deviations.