Biostatistics With R Pdf Download
!!!ALREADY AVAILABLE FROM THE PUBLISHER AND FROM amazon.co.uk!!! From amazon.com it will be available in August 2020. Biostatistics with R provides a straightforward introduction on how to analyse data from the wide field of biological research, including nature protection and global change monitoring. The book is centred around traditional statistical approaches, focusing on those prevailing in research publications. The authors cover t-tests, ANOVA and regression models, but also the advanced methods of generalised linear models and classification and regression trees. Chapters usually start with several useful case examples, describing the structure of typical datasets and proposing research-related questions. All chapters are supplemented by example datasets, step-by-step R code demonstrating analytical procedures and interpretation of results. The authors also provide examples of how to appropriately describe statistical procedures and results of analyses in research papers. This accessible textbook will serve a broad audience, from students, researchers or professionals looking to improve their everyday statistical practice, to lecturers of introductory undergraduate courses. Additional resources are provided on www.cambridge.org/biostatistics.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Biostatistics with R
An Introductory Guide for Field Biologists
Biostatistics with R provides a straightforward introduction on how to analyse
data from the wide fi eld of biological research, including nature protection and
global change monitoring. The book is centred around traditional statistical
approaches, focusing on those prevailing in research publications. The authors
cover t tests, ANOVA and regression models, but also the advanced methods of
generalised linear models and classifi cation and regression trees. Chapters usually
start with several useful case examples, describing the structure of typical datasets
and proposing research-related questions. All chapters are supplemented by
example datasets and thoroughly explained, step-by-step R code demonstrating
the analytical procedures and interpretation of results. The authors also provide
examples of how to appropriately describe statistical procedures and results of
analyses in research papers. This accessible textbook will serve a broad audience
of interested readers, from students, researchers or professionals looking to
improve their everyday statistical practice, to lecturers of introductory under-
graduate courses. Additional resources are provided on www.cambridge.org/
biostatistics.
Jan Lepš is Professor of Ecology in the Department of Botany, Faculty of Science,
University of South Bohemia, Č eské Budě jovice, Czech Republic and senior
researcher in the Biology Centre of the Czech Academy of Sciences in České
Budě jovice. His main research interests include plant functional ecology, particu-
larly the mechanisms of species coexistence and stability, and ecological data
analysis. He has taught many ecological and statistical courses and supervised
more than 80 student theses, from undergraduate to PhD.
Petr Š milauer is Associate Professor of Ecology in the Department of Ecosystem
Biology, Faculty of Science, University of South Bohemia, Č eské Budějovice,
Czech Republic. His main research interests are multivariate statistical analysis,
modern regression methods and the role of arbuscular mycorrhizal symbiosis in
the functioning of plant communities. He is co-author of multivariate analysis
software Canoco 5, CANOCO for Windows 4.5 and TWINSPAN for Windows.
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
'We will never have a textbook of statistics for biologists that satisfi es everybody. However,
this book may come closest. It is based on many years of fi eld research and the teaching of
statistical methods by both authors. All useful classic and advanced statistical concepts and
methods are explained and illustrated with data examples and R programming procedures.
Besides traditional topics that are covered in the premier textbooks of biometry/biostatistics
(e.g. R. R. Sokal & F. J. Rohlf, J. H. Zar), two extensive chapters on multivariate methods in
classifi cation and ordination add to the strength of this book. The text was originally published
in Czech in 2016. The English edition has been substantially updated and two new chapters
'Survival Analysis 'and 'Classifi cation and Regression Trees 'have been added. The book will
be essential reading for undergraduate and graduate students, professional researchers, and
informed managers of natural resources.'
Marcel Rejmánek,
Department of Evolution and Ecology, University of California, Davis, CA, USA
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Biostatistics with R
An Introductory Guide for Field Biologists
JAN LEPŠ
University of South Bohemia, Czech Republic
PETR Š MILAUER
University of South Bohemia, Czech Republic
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314– 321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06-04/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.
It furthers the University' s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781108480383
DOI: 10.1017/9781108616041
© Jan Lepš and Petr Š milauer 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
Printed in the United Kingdom by TJ International Ltd, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-108-48038-3 Hardback
ISBN 978-1-108-72734-1 Paperback
Additional resources for this publication at www.cambridge.org/biostatistics
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Contents
Preface page xiii
Acknowledgements xvii
1 Basic Statistical Terms, Sample Statistics 1
1.1 Cases, Variables and Data Types 1
1.2 Population and Random Sample 3
1.3 Sample Statistics 4
1.4 Precision of Mean Estimate, Standard Error of Mean 9
1.5 Graphical Summary of Individual Variables 10
1.6 Random Variables, Distribution, Distribution Function,
Density Distribution 10
1.7 Example Data 13
1.8 How to Proceed in R 13
1.9 Reporting Analyses 17
1.10 Recommended Reading 18
2 Testing Hypotheses, Goodness-of-Fit Test 19
2.1 Principles of Hypothesis Testing 19
2.2 Possible Errors in Statistical Tests of Hypotheses 21
2.3 Null Models with Parameters Estimated from the Data:
Testing Hardy– Weinberg Equilibrium 26
2.4 Sample Size 26
2.5 Critical Values and Signifi cance Level 27
2.6 Too Good to Be True 29
2.7 Bayesian Statistics: What is It? 30
2.8 The Dark Side of Signifi cance Testing 32
2.9 Example Data 35
2.10 How to Proceed in R 35
2.11 Reporting Analyses 37
2.12 Recommended Reading 37
3 Contingency Tables 39
3.1 Two-Way Contingency Tables 39
3.2 Measures of Association Strength 44
3.3 Multidimensional Contingency Tables 46
3.4 Statistical and Causal Relationship 47
3.5 Visualising Contingency Tables 49
3.6 Example Data 50
3.7 How to Proceed in R 50
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
3.8 Reporting Analyses 54
3.9 Recommended Reading 54
4 Normal Distribution 55
4.1 Main Properties of a Normal Distribution 55
4.2 Skewness and Kurtosis 56
4.3 Standardised Normal Distribution 57
4.4 Verifying the Normality of a Data Distribution 58
4.5 Example Data 60
4.6 How to Proceed in R 60
4.7 Reporting Analyses 63
4.8 Recommended Reading 64
5 Student' st Distribution 65
5.1 Use Case Examples 65
5.2 t Distribution and its Relation to the Normal Distribution 66
5.3 Single Sample Test and Paired t Test 67
5.4 One-Sided Tests 70
5.5 Confi dence Interval of the Mean 72
5.6 Test Assumptions 73
5.7 Reporting Data Variability and Mean Estimate Precision 74
5.8 How Large Should a Sample Size Be? 77
5.9 Example Data 79
5.10 How to Proceed in R 79
5.11 Reporting Analyses 82
5.12 Recommended Reading 83
6 Comparing Two Samples 84
6.1 Use Case Examples 84
6.2 Testing for Differences in Variance 85
6.3 Comparing Means 87
6.4 Example Data 88
6.5 How to Proceed in R 88
6.6 Reporting Analyses 91
6.7 Recommended Reading 91
7 Non-parametric Methods for Two Samples 92
7.1 Mann– Whitney Test 93
7.2 Wilcoxon Test for Paired Observations 95
7.3 Using Rank-Based Tests 97
7.4 Permutation Tests 97
7.5 Example Data 99
7.6 How to Proceed in R 99
7.7 Reporting Analyses 102
7.8 Recommended Reading 103
vi Table of Contents
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
8 One-Way Analysis of Variance (ANOVA) and Kruskal– Wallis Test 104
8.1 Use Case Examples 104
8.2 ANOVA: A Method for Comparing More Than Two Means 104
8.3 Test Assumptions 105
8.4 Sum of Squares Decomposition and the F Statistic 106
8.5 ANOVA for Two Groups and the Two-Sample t Test 108
8.6 Fixed and Random Effects 108
8.7 F Test Power 109
8.8 Violating ANOVA Assumptions 110
8.9 Multiple Comparisons 111
8.10 Non-parametric ANOVA: Kruskal– Wallis Test 115
8.11 Example Data 116
8.12 How to Proceed in R 117
8.13 Reporting Analyses 127
8.14 Recommended Reading 128
9 Two-Way Analysis of Variance 129
9.1 Use Case Examples 129
9.2 Factorial Design 130
9.3 Sum of Squares Decomposition and Test Statistics 132
9.4 Two-Way ANOVA with and without Interactions 134
9.5 Two-Way ANOVA with No Replicates 135
9.6 Experimental Design 135
9.7 Multiple Comparisons 137
9.8 Non-parametric Methods 138
9.9 Example Data 139
9.10 How to Proceed in R 139
9.11 Reporting Analyses 149
9.12 Recommended Reading 150
10 Data Transformations for Analysis of Variance 151
10.1 Assumptions of ANOVA and their Possible Violations 151
10.2 Log-transformation 153
10.3 Arcsine Transformation 156
10.4 Square-Root and Box– Cox Transformation 156
10.5 Concluding Remarks 157
10.6 Example Data 158
10.7 How to Proceed in R 158
10.8 Reporting Analyses 163
10.9 Recommended Reading 163
11 Hierarchical ANOVA, Split-Plot ANOVA, Repeated Measurements 164
11.1 Hierarchical ANOVA 164
11.2 Split-Plot ANOVA 167
11.3 ANOVA for Repeated Measurements 169
Table of Contents vii
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
11.4 Example Data 171
11.5 How to Proceed in R 171
11.6 Reporting Analyses 181
11.7 Recommended Reading 182
12 Simple Linear Regression: Dependency Between Two Quantitative
Variables 183
12.1 Use Case Examples 183
12.2 Regression and Correlation 184
12.3 Simple Linear Regression 184
12.4 Testing Hypotheses 187
12.5 Confi dence and Prediction Intervals 190
12.6 Regression Diagnostics and Transforming Data in Regression 190
12.7 Regression Through the Origin 195
12.8 Predictor with Random Variation 197
12.9 Linear Calibration 197
12.10 Example Data 198
12.11 How to Proceed in R 198
12.12 Reporting Analyses 204
12.13 Recommended Reading 205
13 Correlation: Relationship Between Two Quantitative Variables 206
13.1 Use Case Examples 206
13.2 Correlation as a Dependency Statistic for Two Variables on an Equal
Footing 206
13.3 Test Power 209
13.4 Non-parametric Methods 212
13.5 Interpreting Correlations 212
13.6 Statistical Dependency and Causality 213
13.7 Example Data 216
13.8 How to Proceed in R 216
13.9 Reporting Analyses 218
13.10 Recommended Reading 218
14 Multiple Regression and General Linear Models 219
14.1 Use Case Examples 219
14.2 Dependency of a Response Variable on Multiple Predictors 219
14.3 Partial Correlation 223
14.4 General Linear Models and Analysis of Covariance 224
14.5 Example Data 225
14.6 How to Proceed in R 226
14.7 Reporting Analyses 237
14.8 Recommended Reading 238
viii Table of Contents
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
15 Generalised Linear Models 239
15.1 Use Case Examples 239
15.2 Properties of Generalised Linear Models 240
15.3 Analysis of Deviance 242
15.4 Overdispersion 243
15.5 Log-linear Models 243
15.6 Predictor Selection 244
15.7 Example Data 245
15.8 How to Proceed in R 246
15.9 Reporting Analyses 250
15.10 Recommended Reading 251
16 Regression Models for Non-linear Relationships 252
16.1 Use Case Examples 252
16.2 Introduction 253
16.3 Polynomial Regression 253
16.4 Non-linear Regression 255
16.5 Example Data 256
16.6 How to Proceed in R 256
16.7 Reporting Analyses 259
16.8 Recommended Reading 260
17 Structural Equation Models 261
17.1 Use Case Examples 261
17.2 SEMs and Path Analysis 261
17.3 Example Data 265
17.4 How to Proceed in R 265
17.5 Reporting Analyses 272
17.6 Recommended Reading 272
18 Discrete Distributions and Spatial Point Patterns 274
18.1 Use Case Examples 274
18.2 Poisson Distribution 274
18.3 Comparing the Variance with the Mean to Measure Spatial
Distribution 276
18.4 Spatial Pattern Analyses Based on the K-function 279
18.5 Binomial Distribution 280
18.6 Example Data 283
18.7 How to Proceed in R 283
18.8 Reporting Analyses 289
18.9 Recommended Reading 289
19 Survival Analysis 290
19.1 Use Case Examples 290
19.2 Survival Function and Hazard Rate 291
Table of Contents ix
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
19.3 Differences in Survival Among Groups 293
19.4 Cox Proportional Hazard Model 293
19.5 Example Data 295
19.6 How to Proceed in R 295
19.7 Reporting Analyses 302
19.8 Recommended Reading 302
20 Classifi cation and Regression Trees 303
20.1 Use Case Examples 303
20.2 Introducing CART 304
20.3 Pruning the Tree and Crossvalidation 306
20.4 Competing and Surrogate Predictors 307
20.5 Example Data 308
20.6 How to Proceed in R 309
20.7 Reporting Analyses 316
20.8 Recommended Reading 316
21 Classifi cation 317
21.1 Use Case Examples 317
21.2 Aims and Properties of Classifi cation 317
21.3 Input Data 319
21.4 Similarity and Distance 319
21.5 Clustering Algorithms 320
21.6 Displaying Results 320
21.7 Divisive Methods 321
21.8 Example Data 322
21.9 How to Proceed in R 322
21.10 Other Software 324
21.11 Reporting Analyses 325
21.12 Recommended Reading 325
22 Ordination 326
22.1 Use Case Examples 327
22.2 Unconstrained Ordination Methods 327
22.3 Constrained Ordination Methods 330
22.4 Discriminant Analysis 331
22.5 Example Data 333
22.6 How to Proceed in R 333
22.7 Alternative Software 340
22.8 Reporting Analyses 341
22.9 Recommended Reading 341
Appendix A: First Steps with R Software 343
A.1 Starting and Ending R, Command Line, Organising Data 343
A.2 Managing Your Data 349
xTable of Contents
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
A.3 Data Types in R 351
A.4 Importing Data into R 357
A.5 Simple Graphics 359
A.6 Frameworks for R 360
A.7 Other Introductions to Work with R 362
Index 363
Table of Contents xi
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Preface
Modern biology is a quantitative science. A biologist weighs, measures and
counts, whether she works with aphid or fi sh individuals, with plant communities
or with nuclear DNA. Every number obtained in this way, however, is affected by
random variation. Aphid counts repeatedly obtained from the same plant individ-
ual will differ. The counts of aphids obtained from different plants will differ
more, even if those plants belong to the same species, and samples coming from
plants of different species are likely to differ even more. Similar differences will
be found in the nuclear DNA content of plants from the same population, in
nitrogen content of soil samples taken from the same or different sites, or in the
population densities of copepods across repeated samplings from the same lake.
We say that our data contain a random component: the values we obtain are
random quantities, with a part of their variation resulting from randomness.
But what actually is this randomness? In posing such a question, we
move into the realm of philosophy or to axioms of probability theory. But what is
probability? A biologist is usually happy with a pragmatic concept: we consider
an event to be random if we do not have a causal explanation for it. Statistics is a
research fi eld which provides recipes for how to work with data containing
random components, and how to distinguish deterministic patterns from random
variation. Popular wisdom says that statistics is a branch of science where precise
work is carried out with imprecise numbers. But the term statistics has multiple
meanings. The layman sees it as an assorted collection of values (football league
statistics of goals and points, statistics of MP voting, statistics of cars passing
along a highway, etc.). Statistics is also a research fi eld (often called mathematical
statistics) providing tools for obtaining useful information from such datasets. It is
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
a separate branch of science, to a certain extent representing an application of probability
theory. The term statistic (often in singular form) is also used in another sense: a numerical
characteristic computed from data. For example, the well-known arithmetic average is a
statistic characterising a given data sample.
In scientifi c thinking, we can distinguish deductive and inductive approaches. The
deductive approach leads us from known facts to their consequences. Sherlock Holmes may
use the facts that a room is locked, has no windows and is empty to deduce that the room must
have been locked from the outside. Mathematics is a typical example of a deductive system:
based on axioms, we can use a purely logical (deductive) path to derive further statements,
which are always correct if the initial axioms are also correct (unless we made a mistake in the
derivation). Using the deductive approach, we proceed in a purely logical manner and do not
need any comparison with the situation in real terms.
The inductive approach is different: we try to fi nd general rules based on many
observations. If we tread upon 1-cm-thick ice one hundred times and the ice breaks each time,
we can conclude that ice of this thickness is unable to carry the weight of a grown person. We
conclude this using inductive thinking. We could, however, also employ the deductive
approach by using known physical laws, strength measurements of ice and the known weight
of a grown person. But usually, when treading on thin ice, we do not know its exact thickness
and sometimes the ice breaks and sometimes it does not. Usually we fi nd, only after breaking
through it, that the ice was quite thin. Sometimes even thicker ice breaks, but such an event is
affected by many circumstances we are not able to quantify (ice structure, care in treading, etc.)
and we therefore consider them as random. Using many observations, however, we can estimate
the probability of breaking through ice based on its thickness by using the methods of
mathematical statistics. Statistics is therefore a tool of inductive thinking in such cases, where
the outcome of an experiment (or observation) is affected by random variability.
Thanks to advances in computer technology, statistics is now available to all
biologists. Statistical analysis of data is a necessary prerequisite of manuscript acceptance
in most biological journals. These days, it is impossible to fully understand most of the
research papers in biological journals without understanding the basic principles of statistics.
All biologists must plan their observations and experiments, as only correctly collected data
can be useful when answering their questions with the aid of statistical methods. To collect
your data correctly, you need to have abasic understanding of statistics.
A knowledge of statistics has therefore become essential for successful enquiry in
almost all fi elds of biology. But statistics are also often misused. Some even say that there are
three kinds of lies: a non-intentional lie, an intentional lie and statistics. We can ' adorn'bad
data by employing a complex statistical method so that the result looks like a substantial
contribution to our knowledge (even fi nding its way into prestigious journals). Another
common case of statistical misuse is interpreting statistical (' correlational' ) dependency as
causal. In this way, one can ' prove' almost anything. A knowledge of statistics also allows
biologists to differentiate statements which provide new and useful information from those
where statistics are used to simply mask a lack of information, or are misused to support
incorrect statements.
The way statistics are used in the everyday practice of biology changed substantially
with the increased availability of statistical software. Today, everyone can evaluate her/his
data on a personal computer; the results are just a few mouse clicks away. While your
xiv Preface
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
computer will (almost) always offer some results, often in the form of a nice-looking graph,
this rather convenient process is not without its dangers. There are users who present the
results provided to them by statistical programs without ever understanding what was
computed. Our book therefore tries not only to teach you how to analyse your data, but also
how to understand what the results of statistical processing mean.
What is biostatistics ? We do not think that this is a separate research fi eld. In using
this term, we simply imply a focus on the application of statistics to biological problems.
Alternatively, the term biometry is sometimes used in a similar sense. In our book, we place
an emphasis on understanding the principles of the methods presented and the rules of their
use, not on the mathematical derivation of the methods. We present individual methods in a
way that we believe is convenient for biologists: we fi rst show a few examples of biological
problems that can be solved by a given method, and only then do we present its principles and
assumptions. In our explanations we assume that the reader has attended an introductory
undergraduate mathematical course, including the basics of the theory of probability. Even so,
we try to avoid complex mathematical explanations whenever possible.
This book provides only basic information. We recommend that all readers continue
a more detailed exploration of those methods of interest to them. The three most recom-
mended textbooks for this are Quinn & Keough (2002), Sokal & Rohlf (2012) and Zar (2010).
The fi rst and last of these more closely refl ect the mind of the biologist, as their authors have
themselves participated in ecological research. In this book, we adopt some ideas from Zar 's
textbook about the sequence in which to present selected topics. After every chapter, we give
page ranges for the three referred textbooks, each containing additional information
about the particular methods. Our book is only a slight extension of a one-term course
(2 hours lectures + 2 hours practicals per week) in Biostatistics, and therefore suffi cient detail
is lacking on some of the statistical methods useful for biologists. This primarily concerns the
use of multivariate statistical analysis, traditionally addressed in separate textbooks and
courses.
We assume that our readers will evaluate their data using a personal computer and we
illustrate the required steps and the format of results using two different types of software. The
program R lacks some of the user-friendliness provided by alternative statistical packages, but
offers practically all known statistical methods, including the most modern ones, for free
(more details at cran.r‑ project.org), and so it became de facto a standard tool, prevailing in
published biological research papers. We assume that the reader will have a basic working
knowledge of R, including working with its user interface, importing data or exporting results.
The knowledge required is, however, summarised in Appendix A of this book, which can be
found after the last chapter. The program Statistica represents software for the less demanding
user, with a convenient range of menu choices and extensive dialogue boxes, as well as an
easily accessible and modifi able graphical presentation of results. Instructions for its use are
available to the reader at the textbook' s website: www.cambridge.org/biostatistics.
Example data used throughout this book are available at the same website, but also
from our own university' s web address: www.prf.jcu.cz/biostat-data-eng.xlsx.
Note that in most of our ' use case examples' (and often also in the example data), the
actual (or suggested) number of replicates is very low, perhaps too low to provide reasonable
support for a real-world study. This is just to make the data easily tractable while we
demonstrate the computation of test statistics. For real-world studies, we recommend the
Preface xv
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
reader strives to attain more extensive datasets. If there is no citation for our example dataset,
such data are not real.
In each chapter, we also show how the results derived from statistical software can be
presented in research papers and also how to describe the particular statistical methods there.
In this book, we will most frequently refer to the following three statistical textbooks
providing more details about the methods:
•J. H. Zar (2010) Biostatistical Analysis , 5th edn. Pearson, San Francisco, CA.
•G. P. Quinn & M. J. Keough (2002) Experimental Design and Data Analysis for Biologists.
Cambridge University Press, Cambridge.
•R. R. Sokal & E. J. Rohlf (2012) Biometry , 4th edn. W. H. Freeman, San Francisco, CA.
Other useful textbooks include:
•R. H. Green (1979) Sampling Design and Statistical Methods for Environmental Biologists.
Wiley, New York.
•R. H. G. Jongmann, C. J. F. ter Braak & O. F. R. van Tongeren (1995) Data Analysis in
Community and Landscape Ecology. Cambridge University Press, Cambridge.
•P. Š milauer & J. Lepš (2014) Multivariate Analysis of Ecological Data Using Canoco 5,
2nd edn. Cambridge University Press, Cambridge.
More advanced readers will fi nd the following textbook useful:
•R. Mead (1990) The Design of Experiments. Statistical Principles for Practical Application.
Cambridge University Press, Cambridge.
Where appropriate, we cite additional books and papers at the end of the correspond-
ing chapter.
xvi Preface
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Acknowledgements
Both authors are thankful to their wives Olina and Majka for their ceaseless
support and understanding. Our particular thanks go to Petr ' s wife Majka (Marie
Šmilauerová), who created all the drawings which start and enliven each chapter.
We are grateful to Conor Redmond for his careful and effi cient work at
improving our English grammar and style.
The feedback of our students was of great help when writing this book,
particularly the in-depth review from a student point of view provided by Václava
Hazuková. We appreciate the revision of Section 2.7, kindly provided by Cajo
ter Braak.
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Frontmatter
More Information
www.cambridge.org
© in this web service Cambridge University Press
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
1 Basic Statistical Terms, Sample
Statistics
1.1 Cases, Variables and Data Types
In our research, we observe a set of objects (cases ) of interest and record some information for
each of them. We call all of this collected information the data . If plants are our cases, for
example, then the data might contain information about flower colour, number of leaves, height
of the plant stem or plant biomass. Each characteristic that is measured or estimated for our cases
is called a variable . We can distinguish several data types, each differing in their properties
and consequently in the way we handle the corresponding variables during statistical analysis.
Data on a ratio scale, such as plant height, number of leaves, animal weight, etc.,
are usually quantitative (numerical) data, representing some measurable amount – mass, length,
energy. Such data have a constant distance between any adjacent unit values (e.g. the difference
between lengths of 5 and 6 cm is the same as between 8 and 9 cm) and a naturally de fined zero
value. We can also think about such data as ratios, e.g. a length of 8 cm is twice the length of
4 cm. Usually, these data are non-negative (i.e. their value is either zero or positive).
Data on an interval scale, such as temperature readings in degrees Celsius, are
again quantitative data with a constant distance (interval) between adjacent unit values, but
there is no naturally defi ned zero. When we compare e.g. the temperature scales of Celsius and
Fahrenheit, both have a zero value at different temperatures, which are defi ned rather
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
arbitrarily. For such scales it makes no sense to consider ratios of their values: we cannot say
that 8
C is twice as high a temperature as 4
C. These scales usually cover negative, zero, as
well as positive values. On the contrary, temperature values in Kelvin (
K) can be considered a
variable on a ratio scale.
A special case of data on an interval scale are circular scale data : time of day, days
in a year, compass bearing – azimuth, used often in fi eld ecology to describe the exposition
of a slope. The maximum value for such scales is usually identical with (or adjacent to) the
minimum value (e.g. 0
and 360
). Data on a circular scale must be treated in a specifi c way
and thus there is a special research area developing the appropriate statistical methods to do so
(so-called circular statistics).
Data on an ordinal scale can be exemplifi ed by the state of health of some
individuals: excellent health, lightly ill, heavily ill, dead. A typical property of such data
is that there is no constant distance between adjacent values as this distance cannot be
quantifi ed. But we can order the individual values, i.e. to comparatively relate any two distinct
values (greater than, equal to, less than). In biological research, data on an ordinal scale are
employed when the use of quantitative data is generally not possible or meaningful, e.g. when
measuring the strength of a reaction in ethological studies. Measurements on an ordinal scale are
also often used as a surrogate when the ideal approach to measuring a characteristic (i.e. in a
quantitative manner, using ratio or interval scale) is simply too laborious. This happens e.g. when
recording the degree of herbivory damage on a leaf as none, low, medium, high. In this case it
would of course be possible to attain a more quantitative description by scanning the leaves and
calculating the proportion of area lost, but this might be too time-demanding.
Data on a nominal scale (also called categorical or categorial variables ,or
factors). To give some examples, a nominal variable can describe colour, species identity,
location, identity of experimental block or bedrock type. Such data defi ne membership of
a particular case in a class, i.e. a qualitative characteristic of the object. For this scale, there
are no constant (or even quantifi able) differences among categories, neither can we order the
cases based on such a variable. Categorical data with just two possible values (very often yes
and no ) are often called binary data . Most often they represent the presence or absence of a
character (leaves glabrous or hairy, males or females, organism is alive or dead, etc.).
Ordinal as well as categorical variables are often coded in statistical software as
natural numbers. For example, if we are sampling in multiple locations, we would naturally
code the fi rst location as 1, the second as 2, the third as 3, etc. The software might not know
that these values represent categorical data (if we do not tell it in some way) and be willing
to compute e.g. an arithmetic average of the location identity, quite a nonsensical value.
So beware, some operations can only be done with particular types of data.
Quantitative data (on an interval or a ratio scale) can be further distinguished into
discrete vs. continuous data. For continuous data (such as weights), between any two
measurement values there may typically lie another. In contrast we have discrete data, which
are most often (but not always) counts (e.g. number of leaves per plant), that is non-negative
integer numbers. In biological research, the distinction between discrete and continuous data
is often blurred. For example, the counts of algal cells per 1 ml of water can be considered as a
continuous variable (usually the measurement precision is less than 1 cell). In contrast, when
we estimate tree height in the field using a hypsometer (an optical instrument for measuring
tree height quickly), measurement precision is usually 0.5 m (modern devices using lasers
may be more precise), despite the fact that tree height is a continuous variable. So even when
2Basic Statistical Terms, Sample Statistics
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
the measured variable is continuous, the obtained values have a discrete nature. But this is an
artefact of our measurement method, not a property of the measured characteristic: although
the recorded values of tree height will be repeated across the dataset, the probability of finding
two trees in a forest with identical height is close to zero.
1.2 Population and Random Sample
Our research usually refers to a large (potentially even infi nitely large) group of cases, the
statistical population (or statistical universe), but our conclusions are based on a smaller
group of cases, representing collected observations. This smaller group of observations is called
the random sample, or often simply the sample. Even when we do not use the word random,
we assume randomness in the choice of cases included in our sample. The term (statistical)
population is often not related to what a biologist calls a population. In statistics this word has a
more general meaning. The process of obtaining the sample is called sampling.
To obtain a random sample (as is generally assumed by statistical methods), we must
follow certain rules during case selection: each member (e.g. an individual) in the statistical
population must have the same and independent chance of being selected. The randomness of
our choice should be assured by using random numbers. In the simplest (but often not workable)
approach, we would label all cases in the sampled population with numbers from 1 to N .We
then obtain the required sample of size n by choosing n random whole numbers from the
interval (1, N )insuchawaythateachnumberinthatintervalhasthesamechanceofbeing
selected and we reject the random numbers suggested by the software where the same choice is
repeated. We then proceed by measuring the cases labelled with the selected n numbers.
In fi eld studies estimating e.g. the aboveground biomass in an area, we would
proceed by selecting several sample plots in the area in which the biomass is being collected.
Those plots are chosen by defi ning a system of rectangular coordinates for the whole area and
then generating random coordinates for the centres of individual plots. Here we assume that
the sampled area has a rectangular shape
1
and is large enough so that we can ignore the
possibility that the sample plots will overlap.
It is much more diffi cult to select e.g. the individuals from a population of freely
living organisms, because it is not possible to number all existing individuals. For this, we
typically sample in a way that is assumed to be close to random sampling, and subsequently
work with the sample as if it were random, while often not appreciating the possible dangers
of our results being affected by sampling bias. To give an example, we might want to study a
dormouse population in a forest. We could sample them using traps without knowing the size
of the sampled population. We can consider the individuals caught in traps as a random
sample, but this is likely not a correct expectation. Older, more experienced individuals are
probably better at avoiding traps and therefore will be less represented in our sample. To
adequately account for the possible consequences of this bias, and/or to develop a better
sampling strategy, we need to know a lot about the life history of the dormouse.
But even sampling sedentary organisms is not easy. Numbering all plant individuals
in an area of fi ve acres and then selecting a truly random sample, while certainly possible in
principle, is often unmanageable in practical terms. We therefore require a sampling method
1
But if not, we can still use a rectangular envelope enclosing the more complex area and simply reject the random
coordinates falling outside the actual area.
1.2 Population and Random Sample 3
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
suitable for the target objects and their spatial distribution. It is important to note that a
frequently used sampling strategy in which we choose a random location in the study area
(by generating point coordinates using random values) and then select an individual closest to
this point is not truly random sampling. This is because solitary individuals have a higher
chance of being sampled than those growing in a group. If individuals growing in groups are
smaller (as is often the case due to competition), our estimates of plant characteristics based on
this sampling procedure will be biased.
Stratifi ed sampling represents a specifi c group of sampling strategies. In this
approach, the statistical population is fi rst split into multiple, more homogeneous subsets and
then each subset is randomly sampled. For example, in a morphometric study of a spider species
we can randomly sample males and females to achieve a balanced representation of both sexes.
To take another example, in a study examining the effects of an invasive plant species on the
richness of native communities, we can randomly sample within different climatic regions.
Subjectively choosing individuals, either considered typical for the subject or
seemingly randomly chosen (e.g. following a line across a sampling location and
occasionally picking an individual), is not random sampling and therefore is not
recommended to defi ne a dataset for subsequent statistical analysis.
The sampled population can sometimes be defi ned solely in a hypothetical manner.
For example, in a glasshouse experiment with 10 individuals of meadow sweetgrass (Poa
pratensis), the reference population is a potential set of all possible individuals of this species,
grown under comparable conditions, in the same season, etc.
1.3 Sample Statistics
Let us assume we want to describe the height for a set of 50 pine (Pinus sp.) trees. Fifty values
of their height would represent a complete, albeit somewhat complex, view of the trees. We
therefore need to simplify (summarise) this information, but with a minimal loss of detail.
This type of summarisation can be achieved in two general ways: we can transform our
numerical data into a graphical form (visualise them) or we can describe the set of values with
a few descriptive statistics that summarise the most important properties of the whole dataset.
Among the choice of graphical summaries we have at our disposal, one of the most
often used is the frequency histogram (see Fig. 1.2 later). We can construct a frequency
histogram for a particular numerical variable by dividing the range of values into several
classes (sub-ranges) of the same width and plotting (as the vertical height of each bar) the
count of cases in each class. Sometimes we might want to plot the relative frequencies of cases
rather than simple counts, e.g. as the percentage of the total number of cases in the whole sample
(the histogram' s shape or the information it portrays does not change, only the scale used on the
vertical axis). When we have a suffi cient number of cases and suffi ciently narrow classes
(intervals), the shape of the histogram approaches a characteristic of the variable' sdistribution
called probability density (see Section 1.6 and Fig. 1.2 later). Further information about
graphical summaries is provided in aseparate section on graphical data summaries (Section 1.5).
Alternatively, we can summarise our data using descriptive statistics. Using our
pine heights example, we are interested primarily in two aspects of our dataset: what is the
typical (' mean' ) height of the trees and how much do the individual heights in our sample
4Basic Statistical Terms, Sample Statistics
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
differ. The fi rst aspect is quantifi ed using the characteristics of position (also called central
tendency), the second by the characteristics of variability . The characteristics of a fi nite set
of values (of a random sample or a fi nite statistical population) can be determined precisely. In
contrast, the characteristics of an infi nitely large statistical population (or of a population for
which we have not measured all the cases) must be estimated using a random sample. As a
formal rule, the characteristics of a statistical population are labelled by Greek letters, while
we label the characteristics of a random sample using standard (Latin) letters. The counts of
cases represent an exception: N is the number of cases in a statistical population, while n is the
number of cases (size) of a random sample.
1.3.1 Characteristics of Position
Example questions: What is the height of pine trees in a particular valley? What is the pH of
water in the brooks of a particular region? For trees, we can either measure all of them or be
happy with a random sample. For water pH, we must rely on a random sample, measuring its
values at certain places within certain parts of the season.
Both examples demonstrate how important it is to have a well-defi ned statistical
population (universe). In the case of our pine trees, we would probably be interested in mature
individuals, because mixing the height of mature individuals with that of seedlings and
saplings will not provide useful information. This means that in practice, we will need an
operational defi nition of a ' mature individual' (e.g. at least 20 years old, as estimated by
coring at a specifi c height).
Similarly, for water pH measurements, we would need to specify the type of
streams we are interested in (and then, probably using a geographic information system –
GIS, we select the sampling sites in a way that will correspond to random sampling). Further,
because pH varies systematically during each day, and around the year, we will also need
to specify some time window when we should perform our measurements. In each case,
we need to think carefully about what we consider to be our statistical population with
respect to the aims of study. Mixing pH of various water types might blur the information we
want to obtain. It might be better to have a narrow time window to avoid circadian variability,
but we must consider how informative is, say, the morning pH for the whole ecosystem.
It is probably not reasonable to pool samples from various seasons. In any case, all these
decisions must be specifi ed when reporting the results. Saying that the average pH of streams in
an area is 6.3 without further specifi cation is not very informative, and might be misleading if
we used a narrow subset of all possible streams or a narrow time window. Both of these
examples also demonstrate the diffi culty of obtaining a truly random sample; often we must
simply try our best to select cases that will at least resemble a random sample.
Generally, we are interested in the ' mean' value of some characteristic, so we ask what
the location of values on the chosen measurement scale is. Such an intuitively understood mean
value can be described by multiple characteristics. We will discuss some of these next.
1.3.1.1 Arithmetic Mean (Average)
The arithmetic mean of the statistical population μis
μ¼P N
i¼1 X i
N(1.1)
1.3 Sample Statistics 5
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
while the arithmetic mean of a random sample Xis
X¼P n
i¼1 X i
n(1.2)
Example calculation: The height of fi ve pine trees (in centimetres, measured with a precision
of 10 cm) was 950, 1120, 830, 990, 1060. The arithmetic average is then (950 + 1120 + 830 +
990 + 1060)/5 = 990 cm. The mean is calculated in exactly the same way whether the five
individuals represent our entire population (i.e. all individuals which we are interested in, say
for example if we planted these fi ve individuals 20 years ago and wish to examine their success)
or whether these fi ve individuals form our random sample representing all of the individuals in
the study area, this being our statistical population .Inthefirst case, we will denote the mean by
μ, and this is an exact value. In the second scenario (much more typical in biological sciences),
we will never know the exact value of μ , i.e. the mean height of all the individuals in the area,
but we use the sample mean
Xto estimate its value (i.e.
Xis the estimate of μ).
Be aware that the arithmetic mean (or any other characteristics of location) cannot be
used for raw data measured on a circular scale. Imagine we are measuring the geographic
exposition of tree trunks bearing a particular lichen species. We obtain the following values in
degrees (where both 0 and 360 degrees represent north): 5, 10, 355, 350, 15, 145. Applying
Eq. (1.2), we obtain an average value of 180, suggesting that the mean orientation is facing
south, but actually most trees have a northward orientation. The correct approach to working
with circular data is outlined e.g. in Zar (2010, pp. 605–668).
1.3.1.2 Median and Other Quantiles
The median is defi ned as a value which has an identical number of cases, both above and
below this particular value. Or we can say (for an infi nitely large set) that the probability of the
value for a randomly chosen case being larger than the median (but also smaller than the
median) is identical, i.e. equal to 0.5. For theoretical data distributions (see Section 1.6 later in
this chapter), the median is the value of a random variable with a corresponding distribution
function value equal to 0.5. We can use the median statistic for data on ratio, interval or
ordinal scales. There is no generally accepted symbol for the median statistic.
Besides the median, we can also use other quantiles . The most frequently used are
the two quartiles – the upper quartile ,de fined as the value that separates one-quarter of the
highest-value cases and the lower quartile ,de fined as the value that separates one-quarter of
the lowest-value cases. The other quantiles can be defi ned similarly, and we will return to this
topic when describing the properties of distributions.
In our pine heights example (see Section 1.3.1.1), the median value is equal
to 990 cm (which is equal to the mean, just by chance). We estimate the median by first
sorting the values according to their size. When the sample size (n ) is odd, the median is equal
to X
(n +1)/2
, i.e. to the value in the centre of the list of sorted cases. When n is even, the
median is estimated as the centre of the interval between the two middle observations, i.e. as
(X
n/2
+X
n/2+1
)/2. For example, if we are dealing with animal weights equal to 50, 52, 60, 63,
70, 94 g, the median estimate is 61.5 g. The median is sometimes calculated in a special way when
its location falls among multiple cases with identical values (tied observations), see Zar (2010, p. 26).
As we will see later, the population median value is identical to the value of the
arithmetic mean if the data have a symmetrical distribution. The manner in which the
arithmetic mean and median differ in asymmetrical distributions (see also Fig. 1.1) is shown
6Basic Statistical Terms, Sample Statistics
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
below. In this example we are comparing two groups of organisms which differ in the way
they obtain their food, with each group comprising 11 individuals. The amount of food
(transformed into grams of organic C per day) obtained by each individual was as follows:
Group 1: 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 21
Group 2: 5, 5, 6, 6, 7, 8, 9, 15, 35, 80, 120
In the fi rst group, the arithmetic average of consumed C is 17.8 g, while the average for the
second group is 26.9 g. The average consumption is therefore higher in the second group. But
if we use medians, the value for the fi rst group is 18, but just 8 in the second group. A typical
individual (characterised by the fact that half of the individuals consume more and the other
half less) consumes much more in the fi rst group.
1.3.1.3 Mode
The mode is defi ned as the most frequent value. For data with a continuous distribution, this is
the variable value corresponding to the local maximum (or local maxima) of the probability
density. There might be more than one mode value for a particular variable, as a distribution
can also be bimodal (with two mode values) or even polymodal. The mode is defi ned for all
data types. For continuous data it is usually estimated as the centre of the value interval for the
highest bar in a frequency histogram. If this is a polymodal distribution, we can use the bars
with heights exceeding the height of surrounding bars. It is worth noting that such an estimate
depends on our choice of intervals in the frequency histogram. The fact that we can obtain a
sample histogram that has multiple modes (given the choice of intervals) is not sufficient
evidence of a polymodal distribution for our sampled population values.
1.3.1.4 Geometric Mean
The geometric mean is defi ned as the n -th root of a multiple (Π operator represents the
multiplication) of n values in our sample:
GM ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Yn
i¼1 X i
n
q¼ Y n
i¼1 X i
1= n
(1.3)
The geometric mean of our fi ve pines example will be (950 1120 830 990 1060)
1/5
=
984.9. The geometric mean is generally used for data on a ratio scale which do not contain
zeros and its value is smaller than the arithmetic mean.
ABC
mean
median
modus
modus
median
mean modus
median
mean
Figure 1.1 Frequency histograms idealised into probability density curves, with marked locations
indicating different characteristics of position. Data values are plotted along the horizontal axis and
frequency (probability) on the vertical axis. The distribution in plot A is symmetrical, while in plot B it is
positively skewed and in plot C it is negatively skewed.
1.3 Sample Statistics 7
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
1.3.2 Characteristics of Variability (Spread)
Besides the ' mean value' of the characteristic under observation, we are often interested in the
extent of differences among individual values in the sample, i.e. how variable they are. This is
addressed by the characteristics of variability.
Example question: How variable is the height of our pine trees?
1.3.2.1 Range
The range is the difference between the largest (maximum) and the smallest (minimum)
values in our dataset. In the tree height example the range is 290 cm. Please note that the range
of values grows with increasing sample size. Therefore, the range estimated from a random
sample is not a good estimate of the range in the sampled statistical population.
1.3.2.2 Variance
The variance and the statistics derived from it are the most often used characteristics of
variability. The variance is defi ned as an average value of the second powers (squares) of the
deviations of individual observed values from their arithmetic average. For a statistical
population, the variance is defi ned as follows:
σ2 ¼P N
i¼1 X i μ ðÞ
2
N(1.4)
For a sample, the variance is defi ned as
s2 ¼P n
i¼1 X i
X ðÞ
2
n1(1.5)
The s
2
term is sometimes replaced with var or VAR . The variance of a sample is the best
(unbiased) estimate of the variance of the sampled population.
Example calculation: For our pine trees, the variance is defi ned (if we consider the
five trees as the whole population) as ((950 990)
2
+ (1120 990)
2
+ (830 990)
2
+ (990
990)
2
+ (1060 990)
2
)/5 = 9800. However, it is more likely that these values would represent
a random sample, so the proper estimate of variance is calculated as ((950 990)
2
+ (1120
990)
2
+ (830 990)
2
+ (990 990)
2
+ (1060 990)
2
)/4 = 12,250. Comparing Eqs (1.4) and
(1.5), we can see that the difference between these two estimates diminishes with increasing
n:forfi ve specimens the difference is relatively large, but it is more or less negligible for large n.
The denominator value, i.e. n 1 and not n , is used in the sample because we do not know the
real mean and thus must estimate it. Naturally, the larger our n is, the smaller the difference is
between the estimate
Xand an (unknown) real value of the mean μ.
1.3.2.3 Standard Deviation
The standard deviation is the square root of the variance (for both a sample and a population).
Besides being denoted by an s , it is often marked as s.d. , S.D. or SD . The standard deviation of
a statistical population is defi ned as
σ¼ffiffiffiffiffi
σ2
p(1.6)
The standard deviation of a sample is defi ned as
s¼ffiffiffiffi
s2
p(1.7)
When we consider the fi ve tree heights as a random sample, s =√ 12,250 cm
2
= 110.70 cm.
8Basic Statistical Terms, Sample Statistics
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
1.3.2.4 Coeffi cient of Variation
In many variables measured on a ratio scale, the standard deviation is scaled with
the mean (sizes of individuals are a typical example). We can ask whether the height of
individuals is more variable in a population of the plant species Impatiens glandulifera (with a
typical height of about 2 m) or in a population of Impatiens noli-tangere (with a typical height of
about 30 cm). We must therefore relate the variation with the average height of both groups. In
other similar cases, we characterise variability by the coefficient of variation (CV , sometimes also
CoV), which is a standard deviation estimate divided by the arithmetic mean:
CV ¼ s
X(1.8)
The coeffi cient of variation is meaningful for data on a ratio scale. It is used when we
want to compare the variability of two or more groups of objects differing in their
mean values.
In contrast, it is not possible to use this coeffi cient for data on an interval scale, such as
comparing the variation in temperature among groups differing in their average temperature.
There is no natural zero value and hence the coef ficient of variation gives different results
depending on the chosen temperature scale (e.g. degrees Celsius vs. degrees Fahrenheit).
Similarly, it does not make sense to use the CV for log-transformed data (including pH). In
many cases the standard deviation of log-transformed data provides information similar to CV .
1.3.2.5 Interquartile Range
The interquartile range – calculated as the difference between the upper and lower quartiles –
is also a measure of variation. It is a better characteristic of variation than the range, as it is not
systematically related to the size of our sample. The interquartile range as a measure of
variation (spread) is a natural counterpart to the median as a measure of position (location).
1.4 Precision of Mean Estimate, Standard Error of Mean
The sample arithmetic mean is also a random variable (while the arithmetic mean of a
statistical population is not). So this estimate also has its own variation: if we sample a
statistical population repeatedly, the means calculated from individual samples will differ.
Their variation can be estimated using the variance of the statistical population (or of its
estimate, as the true value is usually not available). The variance of the arithmetic average is
s2
X¼s 2
X=n(1.9)
The square root of this variance is the standard deviation of the mean' s estimate and is
typically called the standard error of the mean . It is often labelled as s
x,SEM or s.e.m., and
is the most commonly employed characteristic of precision for an estimate of the arithmetic
mean. Another often-used statistic is the confi dence interval, calculated from the standard
error and discussed later in Chapter 5. Based on Eq. (1.9), we can obtain a formula for directly
computing the standard error of the mean:
s
X¼ s X
ffiffiffi
n
p(1.10)
1.4 Precision of Mean Estimate, Standard Error of Mean 9
Cambridge University Press
978-1-108-48038-3 — Biostatistics with R
Jan Lepš , Petr Šmilauer
Excerpt
More Information
www.cambridge.org
© in this web service Cambridge University Press
Do not confuse the standard deviation and the standard error of the mean: the
standard deviation describes the variation in sampled data and its estimate is not
systematically dependent on the sample size; the standard error of the mean charac-
terises the precision of our estimate and its value decreases with increasing sample
size – the larger the sample, the greater the precision of the mean' s estimate.
1.5 Graphical Summary of Individual Variables
Most research papers present the characteristics under investigation using the arithmetic mean
and standard deviation, and/or the standard error of the mean estimate. In this way, however,
we lose a great deal of information about our data, e.g. about their distribution. In general,
a properly chosen graph summarising our data can provide much more information than just
one or a couple of numerical statistics.
To summarise the shape of our data distribution, it is easiest to plot a frequency
histogram (see Figs 1.2 and 1.3 below). Another type of graph summarising variable distribu-
tion is the box-and-whisker plot (see Fig. 1.4 explaining individual components of this plot
type and Fig. 1.5 providing an example of its use). Some statistical software packages (this
does not concern R) use the box-and-whisker plot (by default) to present an arithmetic mean
and standard deviation. Such an approach is suitable only if we can assume that the statistical
population for the visualised variable' s values has a normal (Gaussian) distribution (see
Chapter 4). But generally, it is more informative to plot such a graph based on median and
quartiles, as this shows clearly any existing peculiarities of the data distribution and possibly
also identifi es unusual values included in our sample.
1.6 Random Variables, Distribution, Distribution Function,
Density Distribution
All the equations provided so far can be used only for datasets and samples of fi nite size. As
an example, to calculate the mean for a set of values, we must measure all cases in that set and
this is possible only for a set of fi nite size. Imagine now, however, that our sampled statistical
population is infi nite, or we are observing some random process which can be repeated any
number of times and which results in producing a particular value –a particular random entity.
For example, when studying the distribution of plant seeds, we can release each seed using a
tube at a particular height above the soil surface and subsequently measure its speed at the
end of the tube.
2
Such a measurement process can be repeated an infi nite number of times.
3
Measured speed can be considered a random variable and the measured times are the
realisations of that random variable. Observed values of a random variable are actually a
random sample from a potentially infi nite set of values – in this case all possible speeds of the
seeds. This is true for almost all variables we measure in our research, whether in the fi eld or
in the lab.
2
So-called terminal velocity , considered to be a good characteristic of a seed' s ability to disperse in the wind.
3
In practice this is not so simple. When we aim to characterise the dispersal ability of a plant species we should vary
the identity of the seeds, with the tested seeds being a random sample from all the seeds of given species.
10 Basic Statistical Terms, Sample Statistics
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.
Source: https://www.researchgate.net/publication/342590443_Biostatistics_with_R_An_Introductory_Guide_for_Field_Biologists_wwwcambridgeorgbiostatistics
Posted by: anameforyou.blogspot.com