The next section briefly reviews a much broader range of modeling frameworks, and gives some starting points in the modeling literature in case you want to learn more about other kinds of ecological models. To help fit statistical models into the larger picture, Table 1. The discussion of these dichotomies starts to draw in some of the statistical, mathematical and ecological concepts I suggested you should know.
Part of the challenge of learning the material in this book is a chicken-and-egg problem: in order to know why certain technical details are important, you need to know the big picture, but the big picture itself involves knowing some of those technical details. Iterating, or cycling, is the best way to handle this problem. Most of the material introduced in this chapter will be covered in more detail in later chapters.
Each column contrasts a different qualitative style of modeling. The loose association of descriptors in each column gets looser as you work downwards. Theoretical models are often mathematically difficult and ecologically oversimplified, which is the price of generality. Paradoxically, although theoretical models are defined in terms of precise numbers of individuals, because of their simplicity they are usually only used for qualitative predictions.
Applied models are often mathematically simpler although they can require complex computer code , but tend to capture more of the ecological complexity and quirkiness needed to make book August 29, 8 CHAPTER 1 detailed predictions about a particular place and time. Because of this complexity their predictions are often less general. The dichotomy of mathematical vs. A mathematician is more likely to produce a deterministic, dynamic process model without thinking very much about noise and uncertainty e. A statistician, on the other hand, is more likely to produce a stochastic but static model, that treats noise and uncertainty carefully but focuses more on static patterns than on the dynamic processes that produce them e.
The important difference between phenomenological pattern and mechanistic process models will be with us throughout the book. As usual, there are shades of gray; the same function could be classified as either phenomenological or mechanistic depending on why it was chosen. All other things being equal, mechanistic models are more powerful since they tell you about the underlying processes driving patterns.
They are more likely to work correctly when extrapolating beyond the observed conditions. Finally, by making more assumptions, they allow you to extract more information from your data — with the risk of making the wrong assumptions. Further reading: books on ecological modeling overlap with those on ecological theory listed on p. Other good sources include Nisbet and Gurney a well-written but challenging classic Gurney and Nisbet a lighter version Haefner broader, including physiological and ecosystem perspectives Renshaw good coverage of stochastic models , Wilson simulation modeling in C , and Ellner and Guckenheimer dynamics of biological systems in general.
An analytical model is made up of equations solved with algebra and calculus. A computational model consists of a computer program which you run for a range of parameter values to see how it behaves. Most mathematical models and a few statistical models are dynamic; the response variables at a particular time the state of the system feed back to affect the response variables in the future.
Integrating dynamical and statistical models is challenging see Chapter Most statistical models are static; the relationship between predictor and response variables is fixed.
One can specify how models represent the passage of time or the structure of space both can be continuous or discrete ; whether they track continuous population densities or biomass or carbon densities or discrete individuals; whether they consider individuals within a species to be equivalent or divide them by age, size, genotype, or past experience; and whether they track the properties of individuals individual-based or Eulerian or the number of individuals within different categories population-based or Lagrangian.
Deterministic models represent only the average, expected behavior of a system in the absence of random variation, while stochastic models incorporate noise or randomness in some way. A purely deterministic model allows only for qualitative comparisons with real systems; since the model will never match the data exactly, how can you tell if it matches closely book August 29, 10 CHAPTER 1 enough? For example, a deterministic food-web model might predict that introducing pike to a lake would cause a trophic cascade, decreasing the density of phytoplankton because pike prey on sunfish, which eat zooplankton, which in turn consume phytoplankton ; it might even predict the expected magnitude of the change.
In order to test this prediction with real data, however, you would need some kind of statistical model to estimate the magnitude of the average change in several lakes and the uncertainty , and to distinguish between observed changes due to pike introduction and those due to other causes measurement error, seasonal variation, weather, nutrient dynamics, population cycles. Most ecological models incorporate stochasticity crudely, by simply assuming that there is some kind of perhaps normally distributed variation, arising from a combination of unknown factors, and estimating the magnitude of that variation from the variation observed in the field.
We will go beyond this approach, specifying different sources of variability and something about their expected distributions. More sophisticated models of variability enjoy some of the advantages of mechanistic models: models that make explicit assumptions about the underlying causes of variability can both provide more information about the ecological processes at work and can get more out of your data.
It is usually modeled by the standard approach of adding normally distributed variability around a mean value. Likewise, the number of tadpoles out of an initial cohort of 20 eaten by predators in a set amount of time will vary between experiments. Even if we controlled everything about the environment and genotype of the predators and prey, we would still see different numbers dying in each run of the experiment.
Suppose we expect to find three individuals on an isolated island. If we make a measurement error and measure zero instead of three, we may go back at some time in the future and still find them. If an unexpected predator eats all three individuals process variability , and no immigrants arrive, any future observations will find no individuals. The conceptual distinction between process and measurement error is most important in dynamic models, where the process error has a chance to feed back on the dynamics.
The distinctions between stochastic and deterministic effects, and between demographic and environmental variability, are really a matter of definition. What determines whether a tossed coin will land heads-up? Its starting orientation and the number of times it turns in the air, which depends on how hard you toss it Keller, What determines exactly which and how many seedlings of a cohort die?
The amount of energy with which their mother provisions the seeds, their individual light and nutrient environments, and encounters with pathogens and herbivores. Variation that drives mortality in seedlings — e. Climatic variation is random to an ecologist at least on short time scales but might be deterministic, although chaotically unpredictable, to a meteorologist. Similarly, the distinction between demographic variation, internal to the system, and environmental variation, external to the system, varies according to the focus of a study. Is the variation in the number of trees that die every year an internal property of the variability in the population or does it depend on an external climatic variable that is modeled as random noise?
One could quantify simplicity vs. Crudity and sophistication are harder to recognize; they represent the conceptual depth, or the amount of hidden complexity, involved in a model or statistical approach. For example, a computer model that picks random numbers to determine when individuals give birth and die and keeps track of the total population size, for particular values of the birth and death rates and starting population size, is simple and crude. Even simpler, but far more sophisticated, is the mathematical theory of random walks Okubo, which describes the same system but — at the cost of challenging mathematics — predicts its behavior for any birth and death rates and any starting population sizes.
A statistical model that searches at random for the line that minimizes the sum of squared deviations of the data is crude and simple; the theory of linear models, which involves more mathematics, does the same thing in a more powerful and general way. Computer programs, too, can be either crude or sophisticated.
One can pick numbers from a binomial distribution by virtually flipping the right number of coins and seeing how many come up heads, or by using numerical methods that arrive at the same result far more efficiently. A simple R command like rbinom, which picks random binomial deviates, hides a lot of complexity. The value of sophistication is generality, simplicity, and power; its costs are opacity and conceptual and mathematical difficulty.
While the differences among these frameworks are sometimes controversial, most modern statisticians know them all and use whatever tools they need to get the job done; this book will teach you the details of those tools, and the distinctions among them. The two species actually the smallest- and largest-seeded species of a set of eight species are Polyscias fulva pol: seed mass 1.
For a specific experimental procedure such as drawing cards or flipping coins , you calculate the probability of a particular outcome, which is defined as the long-run average frequency of that outcome in a sequence of repeated experiments.
Next you calculate a p-value, defined as the probability of that outcome or any more extreme outcome given a specified null hypothesis. If this so-called tail probability is small, then you reject the null hypothesis: otherwise, you fail to reject it. The frequentist approach to statistics due to Fisher, Neyman and Pearson is useful and very widely used, but it has some serious drawbacks — which are repeatedly pointed out by proponents of other statistical frameworks Berger and Berry, Probably the most criticized aspect of frequentist statistics is their reliance on p-values, which when misused as frequently occurs are poor tools for scientific inference.
We could also reject the null hypothesis, in cases where we have lots of data, even though the results are biologically insignificant — that is, if the estimated effect size is ecologically irrelevant e. Working statisticians will tell you that it is better to focus on estimating the values of biologically meaningful parameters and finding their confidence limits rather than worrying too much about whether p is greater or less than 0. Do you think the answer is likely to be statistically significant? How about biologically significant?
What assumptions or preconceptions does your answer depend on? The total probability that as many or more pol stations had seeds taken, or that the difference was more extreme in the other direction, is the two-tailed frequentist p-value 3. The top axis shows the equivalent in seed predation probability ratios. Note: I put the y-axis on a log scale because the tails of the curve are otherwise too small to see, even though this change means that the area under the curve no longer represents the total probability.
In terms of probability ratios, this example gives 2. The odds ratio and its logarithm the logit or log-odds ratio have nice statistical properties. Most modern statistics uses an approach called maximum likelihood estimation, or approximations to it. For a particular statistical model, maximum likelihood finds the set of parameters e. Based on a model for both the deterministic and stochastic aspects of the data, we can compute the likelihood the probability of the observed outcome given a particular choice of parameters. We then find the set of parameters that makes the likelihood as large as possible, and take the resulting maximum likelihood estimates MLEs as our best guess at the parameters.
For mathematical convenience, we often work with the logarithm of the likelihood the log-likelihood instead of the likelihood; the parameters that give the maximum log-likelihood also give the maximum likelihood. We will see that it lets us both estimate confidence limits for parameters and choose between competing models. Bayesians also use the likelihood — it is part of the recipe for comapproximate percentages: for example, a probability ratio of 1. How would one apply maximum likelihood estimation to the seed predation example?
The likelihood L is the probability that seeds were taken in 51 out of the total of observations. This probability varies as a function of ps Figure 1. This likelihood is small, but it just means that the probability of any particular outcome — seeds being taken in 51 trials rather than 50 or 52 — is small. To answer the questions that really concern us about the different predation probabilities for different species, we need to allow different probabilities for each species, and see how much better we can do how much higher the likelihood is with this more complex model.
If I define the model in terms of the probability for psd and the ratio of the probabilities, I can plot a likelihood profile for the maximum likelihood I can get for a given value of the ratio Figure 1. The maximum-likelihood estimate equals the observed ratio of the probabilities, 3. The likelihood L is the probability of observing the complete data set i. Log-likelihoods are based on natural loge or ln logarithms. The null value ratio equal to 1 is just below the lower limit of the graph.
Likelihood analysis is really a particular flavor of frequentist analysis, one that focuses on writing down a likelihood model and then testing for significant differences in the likelihood ratio rather than applying frequentist statistics directly to the observed outcomes. The Bayesian framework says instead that the experimental outcome — what we actually saw happen — is the truth, while the parameter values or hypotheses have probability distributions.
The Bayesian framework solves many of the conceptual problems of frequentist statistics: answers depend on what we actually saw and not on a range of hypothetical outcomes, and we can legitimately make statements about the probability of different hypotheses or parameter values. The major fly in the ointment of Bayesian statistics is that in order to make it work we have to specify our prior beliefs about the probability of different hypotheses, and these prior beliefs actually affect our answers!
For better or worse, Bayesian statistics operates in the same way as we typically do science: we down-weight observations that are too inconsistent with our current beliefs, while using those in line with our current beliefs to strengthen and sharpen those beliefs statisticians are divided on whether this is good or bad.
The only big disadvantage besides the problem of priors is that problems of small to medium complexity are actually harder with Bayesian approaches than with frequentist approaches — at least in part because most statistical software is geared toward classical statistics. How would a Bayesian answer our question about predation rates? This discrepancy reflects the difference in perspective between frequentists, who believe that the true value is a fixed number and uncertainty lies in what you observe [or might have observed], and Bayesians, who believe that observations are fixed numbers and the true values are uncertain.
Then they might define a parameter, the ratio of the two proportions, and ask questions about the posterior distribution of that parameter—our best estimate of the probability distribution given the observed data and some prior knowledge of its distribution see Chapter 4. What is the mode most probable value of that distribution? What is its expected value, or mean?
The Bayesian answers, in a nutshell: using a flat prior distribution, the mode is 3. The mean is 3. Ecological statisticians are still hotly debating which framework is best, or whether there is a single best framework. Chapter 4 will explain this distinction more carefully. The most probable value is the mode; the expected value is the mean. My own approach is eclectic, agreeing with the advice of Crome and Stephens et al.
We will revisit these frameworks in more detail later. Textbooks like Dalgaard cover classical frequentist approaches very well. We will use a system called R that is both a statistics package and a computing language. This awkward phrase gets at the idea that R is more than just a statistics package. It is a dialect of the S computing language, which was written at Bell Labs in the s as a research tool in statistical computing.
MathSoft, Inc. R is an extremely powerful tool. It is a full-fledged modern computer language with sophisticated data structures; it supports a wide range of computations and statistical procedures; it can produce graphics ranging from book August 29, 24 CHAPTER 1 exploratory plots to customized publication-quality graphics. This cheapness is vital, rather than convenient, for teachers, independent researchers, people in lessdeveloped countries, and students who are frustrated with limited student versions or pirated versions of commercial software.
R is the choice of many academic and industrial statisticians, who work to improve it and to write extension packages. There are only tiny, mostly cosmetic differences among the way that R runs on different machines. You can nearly always move data files and code between operating systems and get the same answers. R is rapidly gaining popularity. The odds are good that someone in your organization is using R, and there are many resources on the Internet including a very active mailing list. There are a growing number of introductory books using R Dalgaard, ; Verzani, ; Crawley, , books of examples Maindonald and Braun, ; Heiberger and Holland, ; Everitt and Hothorn, , more advanced and encyclopedic books covering a range of statistical approaches Venables and Ripley, ; Crawley, , and books on specific topics such as regression analysis Fox ; Faraway , mixed-effect models Pinheiro and Bates, , phylogenetics Paradis, , generalized additive models Wood, , etc.
However, for most of what we will be doing in this book a GUI would not be very useful. Unlike SAS, for which you can buy voluminous manuals that tell you the details of various statistical procedures and how to run them in SAS, R typically assumes that you have a general knowledge of the procedure you want to use and can figure out how to make it work in R by reading the on-line documentation or a separately published book including this one. R is slower than so-called lower-level languages like C and FORTRAN because it is an interpreted language that processes strings of commands typed in at the command line or stored in a text file, rather than a compiled language that first translates commands into machine code.
For most problems you will encounter the limiting factor will be how fast and easily you can write and debug the code, not how long the computer takes to process it. Interpreted languages make writing and debugging faster. R is memory-hungry. Unlike SAS, which was developed with a metaphor of punch cards being processed one at a time, R tries to operate on the whole data set at once. On the other hand, the community of researchers who build and use R are among the best in the world, and R compares well with commercial software Keeling and Pavur, While every piece of software has bugs, the core components of R have been used so extensively by so many people that the chances of your finding a bug in R are about the same as the chances of finding a bug in a commercial software package like SAS or SPSS — and if you do find one and report it, it will probably be fixed within a few days.
Practice switching back and forth between these two levels. Ultimately, knowing how to ask good questions is one of the fundamental skills for any ecologist, or indeed any scientist, and unfortunately there is no recipe telling you how to do it. The deterministic part is the average, or expected pattern in the absence of any kind of randomness or measurement error.
Chapter 3 will remind you of, or introduce you to, a broad range of mathematical models that are useful building blocks for a deterministic model, and provide general tools for getting acquainted with the mathematical properties of deterministic models. Typically, you describe the stochastic model by specifying a reasonable probability distribution for the variation. For example, we often assume that variation that comes from measurement error is normally distributed, while variation in the number of plants found in a quadrat of a specific size is Poisson distributed.
Ecologists tend to be less familiar with stochastic building blocks e.
know to get the most out of the book, discusses the R language, and This book is about combining models with data to answer ecological. Ecological Models and Data in R. This is the web site for a book published by Princeton University Press (ISBN ). It is available from Princeton.
The former are frequently covered in the first week of introductory statistics courses and then forgotten as you learn standard statistical methods. Chapter 4 will re introduce some basics of probability as well as a wide range of probability distributions useful in building stochastic models. This step is a purely technical exercise in figuring out how to get the computer to fit the model to the data.
Unlike the previous steps, it provides no particular insight into the basic ecological questions. The fitting step does require ecological insight both as input for most fitting procedures, you must start with some order-of-magnitude idea of reasonable parameter values and output the fitted parameters are essentially the answers to your ecological question. Chapters 6 and 7 will go into great detail about the practical aspects of fitting: the basic methods, how to make them work in R, and troubleshooting tips.
Without some measurement of uncertainty, such estimates are meaningless. By quantifying the uncertainty in the fit of a model, you can estimate confidence limits for the parameters. You can also test ecological hypotheses, from both an ecological and a statistical point of view e. You also need to quantify uncertainty in order to choose the best out of a set of competing models, or to decide how to weight the predictions of different models. All of these procedures — estimating confidence limits, testing the differences between parameters in two models or between a parameter and a null-hypothesis value such as zero, and testing whether one model is significantly better than another — are closely related aspects of the modeling process that we will discuss in Chapter 6.
You may have answered your questions with a single pass through steps 1—5, but it is far more likely that estimating parameters and confidence limits will force you to redefine your models changing their form or complexity or the ecological covariates they take into account or even to redefine your original ecological questions. You may need to ask different questions, or collect another set of data, to further understand how your system works. Like the first step, this final step is a bit more free-form and general, but there are tools likelihood ratio testing, model selection that will help Chapter 6.
I use this approach for modeling ecological systems every day. It answers ecological questions and, more importantly, it shapes the way I think about data and about those ecological questions. There is a growing number of studies in ecology that use simple but realistic statistical models that do not fit easily into classical statistical frameworks Butler and Burns, ; Ribbens et al. They are most useful for ecological systems where you want to test among a well-defined set of plausible mechanisms, book August 29, 30 CHAPTER 1 and where you have measured a few potentially important predictor and response variables.
For this largely conceptual chapter, the notes are about how to get R and how to get it working on your computer. Find the latest version for your operating system, download it, and follow the instructions to install it. The installation file is moderately large the Windows installer for R version 2. It should be fine to accept all the defaults in the installation process.
R should work well on any reasonably modern computer. Version 2. MacOS version I developed and ran all the code in the book with R 2. After you have played with R a bit, you may want to take a moment to install extra packages see below. Or use the menus your operating system provides to find R. If you are on a Unix system, you can probably just type R on the command line. So details may vary. If you use an equals sign to assign a value to a variable, then R will silently do what you asked. Variable names are case-sensitive, so x and X are different variables.
If you type? If you type help. If you type RSiteSearch "topic" , R will do a search in an on-line database for information on topic.
Try out one or more of these aspects of the help system. You may be able to install new packages from a menu within R. Otherwise, pick a folder where you do have appropriate permissions and install your R packages there. Graphical interfaces such as JGR cross-platform or SciViews Windows include similar editors and have extra functions such as a workspace browser for looking at all the variables you have defined.
All of these interfaces, which are designed to facilitate R programming, are in a different category from Rcmdr, which tries to simplify basic statistics in R. If you are using Windows or Linux I would strongly recommend that, once you have tried R a little bit, you download at least an R-aware editor and possibly a GUI to make your life easier.
Create a second variable tadpoles the density of tadpoles in each population by generating 20 normally distributed random numbers, each with twice the mean of the corresponding frogs population and a standard deviation of 0. Sometimes the continuation character means that you forgot to close parentheses or quotes. This rule of printing a variable that is entered on a line by itself also explains why typing q rather than q prints out R code rather than exiting R. Plot tadpoles against frogs frogs on the x axis, tadpoles on the y axis and add a straight line with intercept 0 and slope 2 to the plot the result should appear in a new window, looking like Figure 1.
Color specification, col, also applies in many other contexts: all colors are set to gray scales here. Use log10 tadpoles to get the logarithm base Figure 1. The summary statistics are only displayed to three significant digits, which can occasionally cause confusion. Data-dredging is a serious problem. Most statisticians are leery of procedures like stepwise regression that search for the best predictors or combinations of predictors from among a large range of options, even though some have elaborate safeguards to avoid overestimating the significance of observed patterns Whittingham et al.
The worst part about using such techniques is that in order to use them you must be conservative and discard real patterns, patterns that you originally had in mind, because you are screening your data indiscriminately Nakagawa, But these injunctions may be too strict for ecologists. Unexpected patterns in the data can inspire you to ask new questions, and it is foolish not to explore your hard-earned data. EDA was developed in the late s when computer graphics first became widely available. Critics have pointed out that similar procedures will also detect hidden messages in War and Peace or Moby Dick McKay et al.
TukeyHSD and the multcomp package in R] that corrects for the fact that you are testing a pattern that was not suggested in advance — however, even these procedures only apply corrections for a specific set of possible comparisons, not all possible patterns that you could have found in your data.
Most of the rest of this book will focus on models that, in contrast to EDA, are parametric i. These methods are more powerful and give more ecologically meaningful answers, but are also susceptible to being misled by unusual patterns in the data. The big advantages of EDA are that it gets you looking at and thinking about your data whereas stepwise approaches are often substitutes for thought , and that it may reveal patterns that standard statistical tests would overlook because of their emphasis on specific models.
Only common sense and caution can keep you in the zone between ignoring interesting patterns and over-interpreting them. The rest of this chapter describes how to get your data into R and how to make some basic graphs in order to search for expected and unexpected patterns. The text covers both philosophy and some nitty-gritty details. The supplement at the end of the chapter gives a sample session and more technical details. Data come in a variety of formats — in ecology, most are either plain text files spaceor comma-delimited or Excel files.
Text files are less structured and may take up more disk space than more specialized formats, but they are the lowest common denominator of file formats and so can be read by almost anything and, if necessary, examined and adjusted in any text editor. R is platform-agnostic.
While text files do have very slightly different formats on Unix, Microsoft Windows, and Macintosh operating systems, R handles these differences. Many ecologists keep their data in Excel spreadsheets. The read. Saving as a. If your data are in some more exotic form e. If you have trouble exporting data or you expect to have large quantities of data e. Metadata Metadata is the information that describes the properties of a data set: the names of the variables, the units they were measured in, when and where the data were collected, etc..
R does not have a structured system for maintaining metadata, but it does allow you to include a good deal of this metadata within your data file, and it is good practice to keep as much of this information as possible associated with the data file.
What currency should be used to describe compartment interactions e. For example, many factors affect the intrinsic growth rate and is often not time-invariant. In contrast to previous ecological theories which considered floods to be catastrophic events, the river flood pulse concept argues that the annual flood pulse is the most important aspect and the most biologically productive feature of a river's ecosystem. The differential equation is now: . The vegetarian provides the diversity measures suggested by Jost , Oikos 2 , ; , Ecology 88 10 , Several packages are now available that implement R functions for widely-used methods and approaches in pedology. Models can be constructed quickly, but there are limits on what can be built and the implementation details are often hidden from the user.
Use underscores or dots to separate words in variable names, not spaces. I also use comments before, or at the ends of, particular lines in the data set that might need annotation, such as the circumstances surrounding questionable data points. You can reference the file in your comments, keep a separate file that lists the location of data and metadata, or use a system like Morpho from ecoinformatics.
Whatever you do, make sure that you have some workable system for maintaining your metadata. Eventually, your R scripts — which document how you read in your data, transformed it, and drew conclusions from it — will also become a part of your metadata. Shape Just as important as electronic or paper format is the organization or shape of your data. Most of the time, R prefers that your data have a single record typically a line of data values for each individual observation.
During the first two weeks of the experiment no seeds of psd or uva were taken by predators, so the number of seeds remained at the initial value of 5. Long format takes up more room, especially if you have data such as dist above, the distance of the station from the edge of the forest that apply to each station independent of sample date or species which therefore have to be repeated many times in the data set.
It is possible to read data into R in wide format and then convert it to long format. In the first case wide to long we specify that the time variable in the new long-format data set should be date and that columns 4—5 are the variables to collapse. In the second case long to wide we specify that date is the variable to expand and that station, dist and species should be kept fixed as the identifiers for an observation. Alternatively, you can manipulate your data in Excel, either with pivot tables or by brute force cutting and pasting.
In the long run, learning to reshape data will pay off, but for a single project it may be quicker to use brute force. If there are no complications in your data, you should be simply be able to say e. There are several potential complications to reading in files, which are more fully covered in the R supplement: 1 finding your data file on your computer system i.
Computer packages vary in how they deal with data. Some lowerlevel languages like C are strongly typed ; they insist that you specify exactly what type every variable should be, and require you to convert variables between types say integer and real, or floating-point explicitly. Languages or packages like R or Excel are looser, and try to guess what you have in mind and convert variables between types coerce automatically as appropriate.
R makes similar guesses as it reads in your data. By default, if every entry in a column is a valid number e. Otherwise, it makes it a factor — an indexed list of values used to represent categorical variables, which I will describe in more detail shortly. Thus, any error in a numeric variable extra decimal point, included letter, etc.
R also has a detailed set of rules for dealing with missing values internally represented as NA, for Not Available. At the most basic level, R organizes data into vectors of one of these types, which are just ordered sets of data. Also, R can often do the right things with your data automatically if it knows what types they are this is an example of crude-vs.
If you want to analyze variation in population density among sites designated with integer codes e. For example, R can automatically plot date axes with appropriate labels. To repeat, data types are a form of metadata; the more information about the meaning of your data that you can retain in your analysis, the better. A data frame is a table of data that combines vectors columns of different types e. Data frames are a hybrid of two simpler data structures: lists, which can mix arbitrary types of data but have no other structure, and matrices, which have rows and columns but usually contain only one data type typically numeric.
You can also treat the data frame as a matrix and use square brackets  to extract e. There are a few operations, such as transposing or calculating a variance-covariance matrix, that you can only do with a matrix not with a data frame ; R will usually convert coerce the data frame to a matrix automatically when it makes sense to, but you may sometimes have to use as. For a numeric variable summary will list the minimum, first quartile, median, mean, third quartile, and maximum. For a factor it will list the numbers of observations with each of the first six factor levels, then the number of remaining observations.
Use table on a factor to see the numbers of observations at all levels. It will list the number of NAs for all types. If x is a data frame, either colnames x or names x will tell you the column names. If x is a matrix, you must use colnames x to get the column names and x[,"a"] to retrieve a column the other commands will give errors.
Use is. Is there the right number of observations in each level for factors? Are the minimum and maximum values about what you expected? If not especially if you have extra mostly-NA columns , you may want to go back a few steps and look at using count. These datasets may also include, but are not limited to, inventory data, laboratory measurements, FLUXNET databases, or data from long-term ecological networks Baldocchi et al.
Such data contain information related to environmental forcing e. Datasets in EcoPAD v1. These datasets are first described and stored with appropriate metadata via either manual operation or scheduled automation from sensors. Each project has a separate folder where data are stored. Data are generally separated into two categories.
One is used as boundary conditions for modeling and the other category is related to observations that are used for data assimilation. Scheduled sensor data are appended to existing data files with prescribed frequency. Attention is then given to how the particular dataset varies over space x , y and time t. When the spatiotemporal variability is understood, it is then placed in metadata records that allow for query through its scientific workflow.
Linkages among the workflow, data assimilation system, and ecological model are based on messaging. For example, the data assimilation system generates parameters that are passed to ecological models. The state variables simulated from ecological models are passed back to the data assimilation system. Models may have different formulations. The common practice makes use of observations to develop or calibrate models to make predictions, while the EcoPAD v1. Data and model are iteratively integrated through its data assimilation systems to improve forecasting.
Its near-real-time forecasting results are shared among research groups through its web interface to guide new data collections. The scientific workflow enables web-based data transfer from sensors, model simulation, data assimilation, forecasting, result analysis, visualization, and reporting, encouraging broad user—model interactions, especially for experimenters and the general public with a limited background in modeling.
The original TECO model has four major submodules canopy, soil water, vegetation dynamics, and soil carbon and nitrogen and is further extended to incorporate methane biogeochemistry and snow dynamics Huang et al. Leaf photosynthesis and stomatal conductance are based on the common scheme from Farquhar et al.
Transpiration and associated latent heat losses are controlled by stomatal conductance, soil water content, and the rooting profile. Evaporation losses of water are balanced between the soil water supply and the atmospheric demand based on the difference between saturation vapor pressure and the actual atmospheric vapor pressure. Soil moisture in different soil layers is regulated by water influxes e. Vegetation dynamic tracks processes such as growth, allocation, and phenology. The soil carbon and nitrogen module tracks carbon and nitrogen through processes such as litterfall, soil organic matter SOM decomposition, and mineralization.
SOM decomposition modeling follows the general form of the Century model Parton et al. SOM is divided into pools with different turnover times the inverse of decomposition rates , which are modified by environmental factors such as the soil temperature and moisture. Data assimilation is growing in importance as process-based ecological models, despite largely simplifying the real systems, need to be complex enough to address sophisticated ecological issues. These ecological issues are composed of an enormous number of biotic and abiotic factors interacting with each other. Data assimilation techniques provide a framework to combine models with data to estimate model parameters Shi et al.
Under the Bayesian paradigm, data assimilation techniques treat the model structure and the initial and parameter values as priors that represent our current understanding of the system. As new information from observations or data becomes available, model parameters and state variables can be updated accordingly. The posterior distributions of estimated parameters or state variables are imprinted with information from the model, observations, and data as the chosen parameters act to reduce mismatches between observations and model simulations.
Future predictions benefit from such constrained posterior distributions through forward modeling Fig. S1 in the Supplement. As a result, the probability density function of predicted future states through data assimilation normally has a narrower spread than that without data assimilation when everything else is equal Niu et al. MCMC is a class of sampling algorithms to draw samples from a probability distribution obtained through constructed Markov chains to approximate the equilibrium distribution.
The Bayesian-based MCMC method takes into account various uncertainty sources that are crucial in interpreting and delivering forecasting results Clark et al. In the application of MCMC, the posterior distribution of a parameter for given observations is proportional to the prior distribution of that parameter and the likelihood function linked to the fit or match or cost function between model simulations and observations. For simplicity, we assume uniform distributions in priors and Gaussian or multivariate Gaussian distributions in observational errors, which can be operationally expanded to other specific distribution forms depending on the available information.
A detailed description is available in Xu et al. Workflow is a relatively new concept in the ecology literature but is essential to realize real- or near-real-time forecasting. Thus, we describe it in detail below. The essential components of the scientific workflow of EcoPAD v1. The workflow system of EcoPAD v1. Datasets can be placed and queried in EcoPAD v1. Calls for good management of current large and heterogeneous ecological datasets are common Vitolo et al.
Kepler Ludascher et al. Similarly to these systems, EcoPAD v1. The EcoPAD v1. Through MongoDB, measured datasets can be easily fed into ecological models for various purposes such as to initialize the model, calibrate model parameters, evaluate model structure, and drive model forecasts. For datasets from real-time ecological sensors that are constantly updating, EcoPAD v1.
Once a user makes a request, such as through clicking on relevant buttons from a web browser, the request is passed through the RESTful API to trigger specific tasks. Hence, a user can incorporate summary data from EcoPAD v1. Simplicity, ease of use, and interoperability are among the main advantages of this API, which enables web-based modeling. The workflow wraps ecological models and data assimilation algorithms with the docker containerization platform. Tasks are managed through the asynchronous task queue, Celery. Tasks can be executed concurrently on a single or more worker servers across different scalable IT infrastructures.
The task queue i. Celery communicates through messages, and EcoPAD v1. These messages may trigger different tasks, which include but are not limited to pulling data from a remote server where original measurements are located, accessing data through a metadata catalog, running model simulations with user-specified parameters, conducting data assimilation that recursively updates model parameters, forecasting future ecosystem status, and post-processing model results for visualization.
The broker inside Celery receives task messages and handles out tasks to available Celery workers that perform the actual tasks Fig. Celery workers are in charge of receiving messages from the broker, executing tasks, and returning task results. The worker can be a local or remote computation resource e. Each worker can perform different tasks depending on the tools installed in each worker.
One task can also be distributed to different workers. In such a way, the EcoPAD v1. Another key feature that makes EcoPAD v1. The docker can run many applications that rely on different libraries and environments on a single kernel with its lightweight containerization. Each docker container embeds the ecosystem model into a complete file system that contains everything needed to run an ecosystem model: the source code, model input, run time, system tools, and libraries.
Docker containers are both hardware-agnostic and platform-agnostic, and they are not confined to a particular language, framework, or packaging system. Docker containers can be run from a laptop, workstation, virtual machine, or any cloud compute instance. This is done to support the widely varied number of ecological models running in various languages e.
In addition to wrapping the ecosystem model into a docker container, software applied in the workflow, such as Celery, RabbitMQ, and MongoDB, are all lightweight and portable encapsulations through docker containers. Therefore, EcoPAD v1. Upon the completion of the model task, the model wrapper code calls a post-processing callback function. This callback function allows model-specific data requirements to be added to the model result repository. Each task is associated with a unique task ID and model results are stored within the local repository that can be queried by the unique task ID.
Researchers are authorized to review and download model results and parameters submitted for each model run through a web-accessible URL link. All current and historical model inputs and outputs are available to download, including the aggregated results produced for graphical web applications. In addition, EcoPAD v1. Such structured result storage and access make sharing, tracking, and referring to modeling studies instantaneous and clear.
SPRUCE is an ongoing project that focuses on long-term responses of northern peatland to climate warming and increased atmospheric CO 2 concentration Hanson et al. At SPRUCE, ecologists measure various aspects of responses of organisms from microbes to trees and ecological functions carbon, nutrient, and water cycles to a warming climate. Together with elevated atmospheric CO 2 treatments, SPRUCE provides a platform for exploring mechanisms controlling the vulnerability of organisms, biogeochemical processes, and ecosystems in response to future novel climatic conditions.
The SPRUCE peatland is especially sensitive to future climate change and also plays an important role in feeding back to future climate change through greenhouse gas emissions as it stores a large amount of soil organic carbon. The studied peatland also has an understory that includes ericaceous and woody shrubs.
There are also a limited number of herbaceous species. These datasets come from multiple sources, including half-hourly automated sensor records, species surveys, laboratory measurements, and laser-scanning images. The involvement of both modeling and experimental studies in the SPRUCE project creates the opportunity for data—model communication. From the web portal, users can check our current near- and long-term forecasting results, conduct model simulation, data assimilation, and forecasting runs, and analyze and visualize model results.
Detailed information about the interactive web portal is provided in the Supplement. We set up the system to automatically pull new data streams every Sunday from the SPRUCE FTP site that holds observational data and update the forecasting results based on new data streams.
At the same time, these results are sent back to SPRUCE communities and displayed together with near-term observations for experimenter reference. The initial methane model constrained by static-chamber methane measurements was used to predict the relative contributions of three methane emission pathways i. After extensive discussion, the model structure was adjusted and field observations were reevaluated.
A second round of forecasting yielded more reliable predictions. EcoPAD-SPRUCE provides a platform to stimulate interactive communication between modelers and experimenters through the loop of prediction—question—discussion—adjustment—prediction Fig. We illustrate how the prediction—question—discussion—adjustment—prediction cycle and stimulation of modeler—experimenter communication improve ecological predictions through one episode during the study of the relative contribution of different pathways to methane emissions.
An initial methane model was built upon information e. The model was used to predict the relative contributions of different pathways to overall methane emissions under different warming treatments after being constrained by measured surface methane fluxes. Initial forecasting results, which indicated a strong contribution from ebullition under high warming treatments, were sent back to the SPRUCE group.
Experimenters doubted such a high contribution from the ebullition pathway and a discussion was stimulated. It is difficult to accurately distinguish the three pathways from field measurements. Field experimenters provided potential avenues to extract measurement information related to these pathways, while modelers examined model structure and parameters that may not be well constrained by available field information. After extensive discussion, several adjustments were adopted as a first step to move forward. For example, the three-porosity model that was used to simulate the diffusion process was replaced by the Millington—Quirk model to more realistically represent methane diffusions in peat soil; the measured static-chamber methane fluxes were also questioned and scrutinized more carefully to clarify that they did not capture the episodic ebullition events.
Measurements such as these related to pore water gas data may provide additional inference related to ebullition. The updated forecasting is more reasonable than the initial results, although more studies are in need to ultimately quantify methane fluxes from different pathways. Initial results indicated a reduction in both the CH 4 :CO 2 ratio and the temperature sensitivity of methane production based on their posterior distributions Fig. The mean CH 4 :CO 2 ratio decreased from 0. Such shifts quantify the potential acclimation of methane production to warming, and future climate warming is likely to have a smaller impact on emissions than most current predictions that do not take account of acclimation.
Despite the fact that these results are preliminary, as more relevant datasets are under collection with current ongoing warming manipulations and measurements, assimilating observations through EcoPAD v1. Melillo et al. Shi et al. They revealed a reduction in the allocation of GPP to shoot, the turnover rates of the shoot and root carbon pools, and an increase in litter and fast carbon turnovers in response to warming treatments.
Uncertainties in ecological studies can come from observations including the forcing that drives the model , different model structures to represent the real world, and the specified model parameters Luo et al. Previous studies tended to focus on one aspect of the uncertainty sources instead of disentangling the contribution from different sources. Keenan et al. Ahlstrom et al. By focusing on multiple instead of one source of uncertainty, ecologists can allocate resources to areas that cause relatively high uncertainty.
The attribution of uncertainties in EcoPAD v1. Jiang et al. Combined with the stochastically generated climate forcing e. Therefore, more efforts are required to improve forcing measurements for studies that focus on carbon fluxes e. Despite the fact that Jiang et al. Carbon cycling studies can also benefit from EcoPAD v1.
The soil environmental condition is an important regulator of belowground biological activities and also feeds back to aboveground vegetation growth. Biophysical variables, such as soil temperature, soil moisture, ice content, and snow depth, are key predictors of ecosystem dynamics. This study emphasized the importance of accurate climate forcing in providing robust thermal forecasts.
In addition, Huang et al. Soil temperature responded more strongly to air warming during summer compared to winter. Soil temperature increased more in shallow soil layers compared to deep soils in summer in response to air warming. Therefore, extrapolating manipulative experiments based on air warming alone may not reflect the real temperature sensitivity of SOM if soil temperature is not monitored. As a robust quantification of environmental conditions is known to be a first step towards a better understanding of ecological process, improvement in soil thermal predictions through the EcoPAD v1.
Through constantly adjusted model and external forcing according to observations and weekly archived model parameter, model structure, external forcing, and forecasting results, the contribution of model and data updates can be tracked by comparing forecasted vs. For example, Fig. We use stochastically generated forcing to represent future meteorological conditions.
Future precipitation and air temperature were generated by vector autoregression using a historical dataset — monitored by the weather station. Photosynthetically active radiation PAR , relative humidity, and wind speed were randomly sampled from the joint frequency distribution at a given hour each month. Detailed information on weather forcing is available in Jiang et al. For demonstrating purposes, Fig. However, ecological systems are inherently complex. Statistical models coupled with empirical data and simulations provide a means of exploring the complexity of ecological systems to better inform environmental decisions.
This class will introduce students to a variety of ecological models while instilling an appreciation for the types of uncertainties that may shroud models to better understand inferences made from them. To see the full listing of biology courses being offered, visit the department page. Visit the Cool Classes homepage to see more of these stories from across a range of programs and disciplines at Bryn Mawr.