# Modeling Input Processes

Discrete event simulations are typically both dynamic (they change over time) and stochastic (they are random). So far, we have concentrated on learning to model the dynamics of a system. In this section, we look at some of the issues and techniques for modeling randomness. Random numbers, trace-driven simulations, parametric input distributions, and empirical input distributions are discussed. The exponential autoregressive process is used to illustrate dependence in the input process. Various sources of data with which to model input processes are presented. A discussion on reusing random number seeds to reduce variance in model output is also included. The section closes with some techniques for generating random variates, including the generation of non-homogeneous Poisson processes. These processes are useful models for generating exogenous events to drive a simulation model.

## Randomness

Although there is much literature in this area, here we will be brief. This is partly due to the fact that simulation modelers rarely write code for this part of their programs. Algorithms for imitating random sampling are well developed, and reliable codes are readily available. Furthermore, while the dynamics of most simulation models are unique, the stochastic logic tends to be the same. Almost all of the simulations we have discussed have had at least one random process as an input to the model. For instance, our carwash model was driven by customers arriving at randomly spaced intervals. We modeled the car arrival process by assuming that the intervals between arrivals were independent and had a particular uniform probability distribution. Realistic situations can be quite a bit more complicated: cars might arrive at a higher rate during rush hour, on weekends, or on days with nice weather. In this section we will review some of the considerations and techniques for generating random input processes to drive a simulation.

Broadly speaking, there are three popular approaches to modeling input processes: using pre-recorded data, using sample probability distributions, and using mathematical probability models. Pre-recorded data is also called a process "trace", a sample probability distribution is also called an "empirical" distribution, and mathematical probability models are sometimes referred to as theoretical or "parametric" models. All but trace-driven simulations require the use of random numbers.

## Trace Driven Simulations

In a trace-driven simulation whenever a value for a random variable is needed by the simulation, it is read from a data file. When it is practical, this input file contains actual historical records. In our carwash example, the trace might be a file of the intervals between successive car arrivals recorded while watching the system. Sometimes only a portion of the input is trace driven. In a fire department simulation, the times and locations of calls might be read from a file with data from a dispatcher's log book while other inputs, such as equipment repair status and travel times, might be generated as they are needed.

Further information: Trace Driven Simulations

## Random Number Generators

At the heart of stochastic modeling are random numbers. We define random numbers as positive fractions whose values are assumed to be independent of each other and equally likely to occur anywhere between zero and one. Random number generators are algorithms that imitate the sampling of random numbers. In SIGMA, RND has its values generated by such an algorithm. The algorithm simply takes an integer that you supply as the "seed" and recursively multiplies it by a fixed constant, divides by another constant, and uses the remainder as the next "seed." This remainder is also scaled to lie strictly between zero and one and used as the current value for RND. The random number generator we are using originated with Lewis, Goodman, and Miller (1969).

We will forego a discussion of the philosophy of random number generation. Suffice it to say that, by most common notions of what we mean by randomness, it is impossible to "generate" random numbers. Indeed, there have been some widely used algorithms that generate numbers that look far from random. Perhaps with the exception of the seed you give it, there is nothing whatsoever truly random in the values of RND or the outputs from any other random number generator; they just "look" random if you are not overly particular. Any random number generator that has passed all statistical tests for randomness simply has not been tested enough. Nevertheless, modeling the output from many random number generators as being true random numbers has been amazingly successful.

It is very easy to modify the SIGMA-generated source code in C to include multiple random number input streams. This is discussed when we introduce correlation induction techniques used in Variance Reduction.

## Using Empirical Input Distributions

With this approach to input modeling, a sample of observations of an input variable is used to estimate an empirical probability distribution for the population from which the sample was taken. The customary estimate of the empirical probability distribution is to assign equal probability to each of the observed values in the sample. If the sample contains N observations, then the empirical probability distribution will assign a weight of 1/N to each of these observations.

Suppose that a trace of interarrival times of customers to our carwash is in the data file ARRIVALS.DAT. Sampling from the empirical distribution is equivalent to reading a randomly chosen value from this file. This is like shuffling and drawing from a data trace. In SIGMA this is done by making the index of the DISK function a random integer from 1 to N. For example, the file, ARRIVALS.DAT, may look like the following:

```.3   .42   .2   .54   .79
```

A delay time of

```DISK{ARRIVAL.DAT;1+5*RND}
```

will result in one of these five numbers (chosen at random) being used. Wrapping around the data file will not occur here since the index is never greater than 5. A considerably more efficient but perhaps less flexible approach is to place the data in an array and generate the index of the array uniformly. If the data is in the array, X, then

```X[1+5*RND]
```

would select one of these values with equal probability. In SIGMA the index is automatically rounded down to the nearest integer.

Similar to using historical data traces as input, the big advantage to using sample distributions to generate input is that there is less concern over validity. However, with sample distributions we can replicate and compress time. The disadvantages to using the empirical input distributions are similar to the disadvantages to using trace input: the data might not be valid, sensitivity analysis to changes in the input process is difficult, we cannot generalize the results to other systems, and it is hard to model rare events. The one major advantage trace input has over empirical distribution sampling comes in modeling dependencies in the input processes. The trace will capture these dependencies whereas the empirical distributions will not.

## Using Parametric Input Distributions

Efficient algorithms have been developed for imitating the sampling from a large number of parametric families of probability models. Some SIGMA functions are presented for artificially generating samples that behave very much as though they were actually drawn from specific parametric distributions.

The values of parameters for these models determine the particular characteristics of the sample. This ability to easily change the nature of the input by changing a few parameter values is the primary advantage of using these models to drive a simulation. The variate generation algorithms in common use are fast and require very little memory. Furthermore, you can easily run replications, compress time, and generalize the results to other systems having the same structure. The major drawback to using parametric input distributions is that they can be difficult to explain and justify to people who have no background in probability and statistics.

Further information: Using Parametric Input Distributions

## Modeling Dependent Input

The book by Johnson (1987) along with the articles by Lewis (1981) and McKenzie (1985) are devoted primarily to the generation of dependent input processes. To illustrate the critical importance of recognizing and modeling dependence in the input processes for a simulation, we will use a simple process called the exponential autoregressive (EAR) process (Lewis, 1981).

Successive values of an EAR process, X, with mean, M, and correlation parameter, R, are generated from the recursion

```X = R*X+M*ERL{1}*(RND>R)
```

with an initial value of X given by the exponential M*ERL{1}. The values of this process will have an exponential distribution, but they are not independent. The correlation between two values of an EAR process that are K observations apart is Rk. At the extremes when R=1, the Boolean variable (RND>R) is always equal to zero and the above expression reduces to

``` X=X.
```

The process never changes value, so the serial correlation is a perfect 1. When R=0, (RND>R) is always equal to 1, and independent (zero correlation) exponential random variables are generated as

```X=M*ERL{1}.
```

The EAR process is easy to use since its serial dependency can be controlled with the single parameter, R. Although histograms of the values of this process look like a simple exponentially distributed sample, the line plots of successive values of an EAR process look rather strange. As is obvious from the EAR process equation, the value of X takes large randomly-spaced jumps and then decreases for a while.

To see the effects of dependent input, consider our simple queueing model, CARWASH.MOD, where we change service times. We will use an EAR process, with mean, M, and a correlation parameter of R. This model is called EAR_Q.MOD. If you run EAR_Q.MOD with the same M but very different values of R, you will see a radical difference in the output series. Dependence in the service times has made the two systems behave very differently. When building this model, if we had looked only at the histograms of service times and ignored the serial dependence on service times, we might have had a very poor model.

## Sources of Data

One of the most common excuses given for not successfully completing a simulation study is the lack of "real-world" data with which to model the input processes. While real data is valuable in establishing the credibility of a simulation, lack of data is not a good excuse for not proceeding with the study. You should be trying a wide range of reasonable input processes to assess system performance sensitivity to changes in the environment. Furthermore, there are many sources of data that should not be overlooked when planning and conducting a simulation study. Each source of data has different associated costs, risks, and benefits.

Futher information: Sources of Data

## Variance Reduction

It is often possible to obtain significantly better results by using the same random number streams for different simulation runs. For example, you might want to compare the performance of two different systems. When doing so, it is a good general experimental technique to make "paired" runs of each system under the same conditions. In simulations you do this by using the same random number seed in a run of each system. This technique, called using common random numbers, extends to more than two alternative systems, You would re-use the same seed for a run of each system. To replicate, choose another seed and run each system again. This technique reduces the variance of estimated differences between the systems.

Another example where re-using random number seeds can help reduce the variance of the output applies when making two runs of the same system. As before, you use the same seeds for both runs in the pair. However, for the second run, use 1-RND where RND was used before. If RND is a random number, then 1-RND is also. Furthermore, there is a perfect negative correlation between RND and 1-RND. If RND is a small random number, 1-RND will be large; if RND is large, 1-RND will be small. The pair of runs using RND and 1-RND are called antithetic replications. The hope is that the negative correlation between the input streams will carry over to the output. If one run produces an output that is unusually high, its antithetic run will have an output that is unusually low. When the antithetic replicates are averaged, a run with an unusually high result is canceled by its antithetic replicate having an unusually low outcome and vice versa.

Re-using random number seeds falls under the general category of variance reduction techniques. For a discussion of common and antithetic random numbers as well as some other techniques (which tend to be much less successful in practice), see the text by Bratley, Fox, and Schrage (1987).

The chances that the beneficial results of correlated input streams carry over to the output are greater if the runs can be synchronized as much as possible. That is to say, we want any unusual sequence of random numbers in a run also to be used in the same manner in its commonly seeded replicate(s). For example, if one run in a queueing simulation has an unusual sequence of long service times that causes the system to become very congested, we would like its antithetic replicate to have an unusual sequence of short service times that reduces congestion in the system. We would like all systems using common random numbers to have the same experiences. Synchronization of runs is generally improved if we use different random number streams exclusively for different stochastic components of our simulation. For example, in a queue we might use one stream to generate interarrival times and another stream to generate service times. Thus, we will want to use more than one sequence of random numbers in our simulation.

## Using Multiple Random Number Streams (Development licensees only)

Function definitions in C make it very easy to change your SIGMA random number stream to a "vector" of random number streams. We discuss how to generate C simulation programs in Linking. In your SIGMA-generated C code, replace RND with RND[I], where I is an integer indicating which stream you want to use. For example, to draw a random number from stream 3, replace RND with RND. Then "vectorize" your library functions by replacing rndsd with rndsd[i] and RND with RND(I) in SIGMALIB.C (for development licensees only) Note that RND is now a function and rndsd becomes an array. If you are using three different random number streams in your model, you would make two changes in the library header file for your C compiler (SIGMALIB.H, SIGMAFNS.H). The two replacement lines would be:

```long rndsd;  /*makes rndsd a vector*/
#define RND(j) ((float) (rndsd[j] = lcg(rndsd[j]))*4.656612875e-10)
```

Like before, you still would have to read in the seeds for each stream when you run your model. You can always substitute the random number generator that comes with your compiler for RND.

See also: Replacing the Default Random Number Generator with a Multiple Stream Generator

## Methods for Generating Random Variates

Techniques for generating random variates are presented. Included among the techniques is the generation of non-homogeneous Poisson processes.

Further information: Methods of Generating Random Variates

# Graphical & Statistical & Output Analysis

In keeping with our philosophy of utilizing pictures whenever possible, several graphical methods of presenting simulation results have been incorporated into the SIGMA software. The graphical output charts available to you while in SIGMA are line plots, step plots, scatter plots, array plots, histograms, autocorrelations, and standardized time series. This section discusses these graphical output plots and concludes with an explanation of standardized time series.

## Keeping a Perspective

It is easy to become overwhelmed by the information produced by a simulation model. Different types of simulation output are most useful during different stages of a simulation study. Animations, graphs, and statistics all have their appropriate roles to play.

During the initial development and testing of the simulation model, animating the simulation logic while running SIGMA in Single Step or Graphics run mode is the most valuable. The logical animation in SIGMA is different from the physical animation of a system. Physical animations are useful in selling the simulation to prospective users and for catching gross logic errors in the simulation model. Physical animations using SIGMA are discussed in Animations.

The Translate to English feature of SIGMA (found under the File menu) is an extremely effective tool for catching program logic errors or communicating the details of a model to persons not familiar with simulation. If the SIGMA-generated English description of your simulation is nonsense, it is likely that the logic in your model will not make sense either.

When evaluating alternative system designs at a high level, charts of the output are most useful. Here we are comparing the performance of very different systems (e.g., manual operations versus automated ones). Plots of the output not only offer information on overall system performance but also on the dynamics of the system. We can see, for example, if the manner in which we initialized the variables in our simulation has an inappropriate influence on the output. We will say more about the bias caused by initializing variables in Advanced Graphical Analysis.

Once a particular design has been tentatively selected, it is important to do detailed sensitivity analysis and design optimization before a final recommendation is proposed. Here we are going to run a great many replications with different settings of the factors and parameters in our model. It is neither fun nor particularly informative to watch hundreds of different animations or look at hundreds of output plots. In the detailed design phase of a simulation study, numerical summaries of system performance in the form of output statistics are the most appropriate form of simulation output.

Finally, once a design has been finalized, the most effective form of output is a physical animation that lets people more fully understand the changes being suggested. Charts, statistical summaries, and the English description of your model can also be effective in helping you sell your ideas. The above discussion is summarized in the following table.

```Typical Phases of a Simulation Project and Predominant Form of Output.
```
```Phase of the Study	        Predominant Form of the Output
```
```Model building and validation	Logical animations and English Descriptions
System evaluation	        Charts and plots
Detailed design	        Statistics
Implementation	                Physical animations
```

## Elementary Output Charts

The Output Plot dialog box is discussed in detail in Running Models. In SIGMA there are five basic output plots and two plots for advanced analysis. The basic charts are:

1. Step plots, which show the values of traced variables during a simulation run.
2. Line plots, which are similar to step plots except straight lines are drawn between successive data points.
3. Array plots, which show values of each element in an array.
4. Scatter plots, which show the relationship between pairs of traced output variables.
5. Histograms, which count the relative frequencies that different values of a variable occur.

More advanced output analysis is possible using the following charts:

6. Autocorrelation functions, which shows dependencies in the output.
7. Standardized time series (STS), which can be used to detect trends.

Further information: Elementary Output Charts

Standardized time series (STS) plots and the autocorrelation function are two important tools for detecting trends and measuring dependencies.

## Using Statistics

Use a statistical technique known as "batching", data can can often be made more independent, less erratic, and have an approximately normal distribution.

Further information: Using Statistics

## Standardized Time Series

A statistical guide to standardizing a time series.

Further information Standardized Time Series