Sources of Data

From Support

Jump to: navigation, search

One of the most common excuses given for not successfully completing a simulation study is the lack of "real-world" data with which to model the input processes. While real data is valuable in establishing the credibility of a simulation, lack of data is not a good excuse for not proceeding with the study. You should be trying a wide range of reasonable input processes to assess system performance sensitivity to changes in the environment. Furthermore, there are many sources of data that should not be overlooked when planning and conducting a simulation study. Each source of data has different associated costs, risks, and benefits.

At the least detailed level, there are physical constraints of the system being modeled, such as space limitations for a waiting line. This is reliable, low-cost information that gives design insight. It tells a great deal about the rationale for the way a system was designed. Unfortunately, it is static information and is of little help in modeling the system dynamics. At the next level of detail, there are the subjective opinions of persons involved with the system. This is low-cost (but unreliable), static data that provides behavioral insights about the people involved with designing, operating, and managing the system. Increasingly detailed information can be obtained from aggregate reports on system operations such as labor, production, and scrap reports. This is low-cost, verifiable, static data that can provide performance insight. Information that is useful in modeling the dynamics of the system can sometimes be obtained from artificial data, classical Industrial Engineering MTM methodology. This type of information can give standard times (with allowances for fatigue) for performing different manual operations in a system. The cost of this information is moderate, and you do not need to have access to the actual system to obtain it. However, its validity depends on the skill and experience of the person doing the analysis. This data uses detailed motion analysis and provides policy insights in how the system managers and designers intend people to perform their jobs. Finally, the most expensive source of data is direct observation. The validity of this data depends on the skill of the observer and the relationship between the observer and the person being observed. Direct observations of a system's operations might be collected manually with time studies or mechanically with sensors. This data provides the operational insights needed to accurately model system dynamics.

Alternate sources of information are sometimes overlooked. For example, part routing sheets can be used to verify traced job flows in a factory. Production records might be used to augment and validate data on the reliability of machines. Knowing the number of machine cycles in a particular time period from production records along with the total number of failures from maintenance records permit you to estimate the probability that a machine will fail on a given operation cycle. It is unlikely that these failures are independent; however, at least you have a starting place for your sensitivity analysis and a potential consistency check for verifying more detailed machine failure testing data.

When deciding on what types and how much data are needed for a simulation, sensitivity analysis is of great value. Change the values of an input parameter to your simulation. If the measured system output does not change significantly, you do not need a better estimate of that parameter. On the other hand, if the output is highly sensitive to variations in a particular input parameter, you had best devote some effort to estimating the true value or range of that parameter. Sensitivity is only one key factor in determining a need for more information about an input process. The other is the degree of control that you might have over the process. If the process is essentially beyond your control, detailed data collection of the current behavior of the process is probably not worthwhile. For example, customer arrivals at your carwash are not under your direct control. Knowing a great deal about the current demand is not critical. Hopefully, the demand for your carwash will increase dramatically from its present level once the improvements from your simulation study are implemented. You should run any prospective design against both high and low demand.

Finally, one needs to be alert for communication problems when collecting data. You might think that the data is about one thing when it is really about another. Or the data might have been translated, scaled, or simply recorded incorrectly. Fortunately, these are not fatal problems in carefully conducted simulation studies. We are going to change the input data during our sensitivity analysis experiments anyway.

To help keep the relative importance of real-word data in perspective it may be useful to remember the following Five Dastardly D's of Data. Data can be:

1. Distorted: The values of some observations may be changed or not consistently defined. For
   examples, travel times may include loading and unloading times, which would tend to overestimate
   the value of a faster vehicle, or a product demand data may include only backlogged orders,
   ignoring customers who refused to wait.
2. Dated: The data may be relevant to a system that has or will be changed. Perhaps factory data
   was collected for an older process or using last year's product mix.
3. Deleted: Observations may be missing from the data set. This might be because the data was
   collected over an interval of time, and events such as machine failures simply did not occur
   during the study period. Medical trial data might be censored by patients dropping out of the
   study for various and unknown reasons.
4. Dependent: Data may be summarized (i.e., only daily averages are reported). This may remove
   critical cycles or other trends in a data sequence or hide relationships between different
   sequences. For example, data from a surgical unit might give very accurate estimates of the
   distributions of preparation, operation, and recovery times. However, it may fail to capture the
   fact that some procedures will tend to have large values for all three times while others
   procedures may tend to have all small values.
5. Deceptive: Any of the first four data problems might be intentional.

Back to Inputs/Outputs

Personal tools