We very often use computer-based simulation here at StrategyWise to address questions regarding business and manufacturing processes. Simulation typically needs more expertise than other modeling techniques, but it has a lot to recommend it:

  1. Simulation can often address questions that are mathematically intractable or could not be answered without great mathematical skill.
  2. It can give insights into problems for which there is little data but for which we can make good guesses at how the components of the system to be modeled act.
  3. It allows experiments to be conducted quickly, cheaply and safely.
  4. Simulation can be one of the most convincing forms of analysis for the non-analyst, particularly when visualization is involved.
  5. In our experience, simulation is particularly good at explaining complex behaviors and confirming intuitions.

If you’re considering using this technique, please read through the following tips. Doing so might save you from making a few costly mistakes. And we’re fairly sure you’ll see that we really know what we’re talking about when it comes to simulation!

First Consider Each Simulation Process in Isolation

Complex simulations typically involve a lot of parameters. A simulation of a fast-food restaurant, for instance, will have parameters for customer queueing, ordering, food preparation and food presentation. So many parameter combinations are possible that only a small fraction can be simulated. And looking at an entire simulated system at once can be overwhelming. It can be very difficult to see where the bottlenecks are.

This problem can be addressed by simulating component processes in isolation. For example, in a fast-food restaurant simulation, first simulate just the customer queueing process. That should require few enough parameters that it can be understood well. Then simulate the ordering process. Then the food preparation process. And so on. Flood each process with demand in order to see which parameter combinations turn each of these processes into a bottleneck.

Once you understand the processes in isolation, run the simulation in its entirety and note how the processes interact. You’ll see that bottleneck processes are particularly likely to affect other processes. For example, if food preparation is the bottleneck in a restaurant, expect to see the staff taking customer orders do their bit to help in the kitchen. This interaction between processes will slow the ordering process while speeding up the food-preparation process.

Be careful when your analysis indicates that a simulated process is almost the bottleneck. For example, suppose that a fast-food restaurant simulation gives the following average process throughputs:

Process Customer Throughput per Hour
Queueing 300
Ordering 100
Paying 110
Food Presentation 350

Clearly ordering, with the lowest throughput (100 customers), is the bottleneck process. Given that, what is your estimate of per-hour restaurant throughput?

Give up? Well, if you guessed 100 customers, the lowest throughput, you’re probably wrong. Note that the paying process, with a 110-customer throughput, is almost the bottleneck. That means that there will likely be periods in any given hour when paying indeed becomes the bottleneck and causes the ordering process to become backed up. This interaction forces average throughput for the entire system to fall below that of the slowest process. So, we estimate per hour restaurant throughput to be less than 100 customers.

Graph the Simulated Process Over Time

Simulation is complex. To verify that a simulation is behaving correctly, you’ll need a graph that shows how the simulated process changes over time. At the very least, you should graph:

  1. How each agent (customer, car, etc.) in the simulation changes its state over time; and
  2. When each resource (payment terminal, order taker, etc.) is claimed by an agent and when it is then released.

The resulting graphs will likely be essential to discussing your simulation with those who have a deep understanding of whatever process is being simulated. They are useful also for spotting process bottlenecks.

Consider the example below, showing the cooking process in a fast-food restaurant. Bar colors indicate how platters of food change their state: from prepping to cooking and then to cooked. The vertical lines show when each of the platters claims a cooker. Incomprehensible, perhaps, to a CEO, but essential for a process analyst.

Incorporate Animation

A simulation of any complexity should use animation. For example, if you’re writing an elevator simulation, you’ll want to see elevators moving up and down elevator shafts and passengers waiting for elevators. If it’s a phone bank simulation, you should see the call-waiting queue and the state of each operator.

Animations help you verify that the agents in your simulation (e.g., the elevators and the passengers) are behaving as expected. They show why bottlenecks occur. And they sell the results of your simulation.

Let us explain that with an anecdote. Some months ago, we were helping a client convince city council that his proposed parking deck would not cause traffic jams. We came to a council meeting equipped with spreadsheets and slide decks; we explained our calculations and model assumptions. But nothing we said was getting through. And then we played a two-minute animated simulation showing cars exiting the parking deck. It was quite clear that the exiting cars didn’t interfere with road traffic. The objections we were hearing melted away.

That wasn’t a one-off. Time and again, we’ve won over a difficult, often non-technical audience using simulation animations. And each time we do, we hear the same thing: Now I get it.

Consider a Slot-Based Approach

A handheld Donkey Kong game from 1982. This was a slot-based game, with only 15 slots into which Mario could be placed. Yes, this was a thing.

Many simulations move their agents (cars, customers, etc.) in a continuous manner. Consider, for example, a car in a simulation of a restaurant drive-through. If movement is continuous, the car smoothly accelerates when the car in front of it starts moving. And in the animation it moves forward in tiny pixel-by-pixel jumps. 

This sort of simulation can be difficult to write. Tricky mathematics may be needed, for instance, for acceleration in an arc. And with each tiny movement of an agent, nearby agents may have to recalculate their actions. For instance, if a car starts to cross lanes in a drive-through, the cars behind it will have to reassess whether to move forward, stop or cross lanes. Collisions must be guarded against. And deadlocks can arise where two cars are stuck waiting for each other to move. It gets complex.

We instead advise having agents move in a discrete manner. In this setup, there are a limited number of slots into which each agent may move—think here of how Mario moves in the ancient handheld Donkey Kong game. This slot-based approach is much easier to code.

Slot-based animations are typically faster than non-slot-based animations. (Although blitting can speed things up for the latter case.) This is in part because the agent sprites can be drawn on the canvas in all their possible positions at the start of the animation; animation then becomes a matter of turning the visibility of these sprites on and off. Another reason slot-based is faster: Agents are not second-guessing their decisions with each tiny movement they make.

Generate Videos Rather Than On-The-Fly Animations

So, you’ve decided to incorporate animation in your simulation. But how will you do that?

One option is to show the animation as the simulation runs. But when the simulation hits a processor-intensive state, you’ll typically see the animation slow down. The result can be a jerky, poor quality animation.

We suggest that you instead have your code create videos that become available after the code has completed. One way is to generate a JSON logfile, as described below, and then have a separate code module read in that file and generate the video.

Generate Comprehensive Log Files

Log files are very important in complex simulations. They’re helpful for bug fixing and for understanding why simulation agents behave in the ways they do. But they’re no substitute for proper visualization. You need both.

Consider generating two logfiles: A human-readable one and a JSON logfile that completely describes the state of the simulation at each tick of the animation clock. This JSON file can be used to separate the task of animation from the simulation proper. So, if you feel you’re not up to creating the animation, you can pass that file across to someone who is (or, rather, to the code that they write).

Extra tip: Examine your logfile using a text editor—such as Notepad++—that automatically highlights in the file all occurrences of a word when that word is double-clicked on. This simple little feature is very useful. It makes it easier, for instance, to track the behavior of agents by their IDs and to work with timestamps.

Base Your Iteration Loop on a Non-Fixed Time Interval

The most obvious way of modeling time in a simulation is via a loop, each iteration of which represents the passage of a unit of time—e.g., a second or a day. Alternatively, each iteration may represent the time it takes for a standard action to occur—e.g., the 1.45 seconds it takes for a customer to walk forward one slot in a queue. In this case, the time it takes to complete any other action must be a multiple of the time required for that standard action.

We advise against these approaches. Each iteration should not represent the passage of a fixed quantity of time. It should instead represent time-to-the-next-simulation-agent action, however long or short that be.

For example, consider the agents in an elevator simulation:

  • Elevator One will reach the first floor in 5.6 seconds.
  • Elevator Two will reach the first floor in 9.2 seconds.
  • Person One will reach the door to Elevator One on the first floor in 7.1 seconds.
  • It takes 1.8 seconds to enter a waiting elevator.

The first loop iteration of the simulation will represent 5.6 seconds—the time it takes for Elevator One to reach the first floor. The second iteration is 1.5 seconds—the time it takes for Person One to reach Elevator One. The next is 1.8 seconds—the time it takes for Person One to enter Elevator One. And the next is 0.3 seconds—the time it takes for Elevator Two to reach the first floor.

This approach has several advantages:

  1. It accommodates actions that take less than one time unit.
    For example, it accommodates the fraction of a second it takes for a customer to notice that the next customer in a queue is moving forward.
  2. It does not force the time taken by all actions to be a multiple of the iteration time unit.
    This is particularly important when simulations involve queues. If using a one-second counter, say, the time it takes to move forward one spot in a queue will have to be rounded to the nearest second. Passage to the front of the queue will accumulate rounding errors, which, in aggregate, could be significant.
  3. It speeds up simulation runtimes.
    Why? Because there are no iteration loops in which nothing happens other than the clock ticking forward.

In simulation parlance, we’re advocating here for next-event time progression over fixed-increment time progression.

Use Object Orientation

Um, you do know this, don’t you? If you’ve got coding skills enough to be writing complex simulations, you surely know-how to write your own classes. And you know that each simulation agent (a car, a customer, etc.) is best thought of as a class instance—an object. But you do know that… right?

As an aside, we suggest that the class of each simulation agent be derived from an agent base class. That base class should include a member variable that gives the time of the next scheduled action by that agent and another member variable describing that intended action. Determining the time of the next event in the simulation is then just a matter of polling all agents to see which is the next to act. Once that agent has acted, you then adjust the next-actions and next-action times of all affected agents.

Try Not to Hardcode the Simulation Process

A lot of simulations look remarkably alike. For example, a common pattern is this:

  1. An agent (e.g., a food tray) enters a new state (e.g., food-prepping).
  2. The agent stays in that state for a period of time determined by a specified distribution.
  3. The agent then proceeds to its next state (e.g., frying) or it waits on a resource (e.g., a fryer) before advancing its state.

An example using the above pattern is a simple fry station comprising two food platters and one fryer:

Agent/Resource

State

Min. Seconds

Max. Seconds

Resource Needed

Next State

Fries Platter

Prepping

60

90

Frying

Fries Platter

Frying

240

300

Fryer

Prepping

Fish Fingers Platter

Prepping

90

110

Frying

Fish Fingers Platter

Frying

300

360

Fryer

Prepping

Fryer

 

To simulate this example, we believe the best approach is not to hardcode things. Rather, write a simple simulation engine that reads in this table and processes it using the above algorithm. The result will be more extensible and, in our experience, simpler and more robust.

Incorporate A Warmup Period

You’re probably interested in simulating the operation of a process under peak activity. So give your simulation a lead-in run time during which statistics are not collected. For example, if simulating a fast-food restaurant, allow the customer queues and the food-waiting queues to fill up before starting the simulation proper. That way your collected statistics will better reflect peak-demand activity.

One more tip. You’ll probably face disagreement over agents that enter the simulation in the warmup period but exit in the non-warmup period. Some folks will say they should contribute to your simulation results; others will disagree. Make a decision on this, but don’t hardcode it. You might change your mind or you client may disagree with your decision.

Know How to Use Poisson Processes

Suppose you’re writing a drive-through simulation. You know that on average 100 cars enter the drive- through each hour. And you believe that these entrances may reasonably be assumed to follow a Poisson process. In particular, you believe that the probability of a car entering the drive-thru in any small time interval is a constant.

Suppose also, that your iteration loop represents a fixed period of time: One second. The probability of k cars entering the drive-thru in any tick of your simulation clock is:

where λ=100/3600, the average number of cars entering the drive-thru in a one-second period. In a table:

Number of Cars in a One-Second Interval Probability Expected Occurrences Per Hour
0 0.972604                                         3,501.38
1 0.027017                                               97.26
2 0.000375                                                 1.35
3 0.000003                                                 0.01


As you can see, it almost never happens that more that one car enters the drive-thru in any given second. In fact, the probability of this is 1-e^(-λ) (1+λ)= 0.000379. And in a given hour we may expect there to be only 0.000379×3600=1.36 second intervals.

So why bother with this Poisson nonsense here? At each iteration loop, just take a random number between 0 and 1. A car attempts to enter in that second if and only if that number is less than λ — and no more than one car may attempt to enter in the same one-second interval. Much simpler.

Now suppose you’re instead using a non-fixed time interval—as recommended above. Each time a car enters the drive-thru you’ll need to get the number of seconds till the next car enters. To do this, set p to be a random number between 0 and 1 (i.e. a random sample from the unit uniform distribution). The time (in seconds) to the next entry is then:

 

 

Think Beyond the Triangular Distribution (Or: Lognormal is Your Friend)

In our experience, simulations mostly need to sample from distributions that are:

  1. Bounded
    Very extreme values will typically need to be guarded against by setting realistic bounds on unbounded distributions. And in many cases, distributions must be constrained to be strictly positive.
  2. Asymmetric
    Business data rarely follows symmetric distributions. For example, a few customers will make huge purchases while the rest make small purchases; a few orders will take a long time to serve while the rest can be quickly dealt with. To model this, distributions need to allow for extreme asymmetry.
  3. Parameterized in an easily understandable manner
    The lognormal distribution, for example, is typically parameterized by the mean and standard deviation of its underlying normal distribution. For non-statisticians, this can be cryptic. A parameterization using the mode and mean or the median and mean is much more easily understood. And don’t forget that to put bounds on an unbounded distribution at least one extra parameter must be supplied. For example, with a lognormal distribution, one might set an upper bound at the 99th percentile. Meaning that a sample from the lognormal will be thrown away if it’s above that percentile.
  4. A reasonable approximation of reality
    This is where the triangular distribution falls flat. The triangular distribution is very widely used in simulations because it is bounded, (optionally) asymmetric and very simply parameterized. But we’ve never seen it provide a good fit to real-world data.

Unfortunately, no distribution that we know of meets all these requirements. The triangular distribution, as mentioned, rarely approximates reality well. The normal distribution is unbounded and symmetric. The exponential distribution and the lognormal distribution are unbounded. Etc.

But let’s give a little love here to the lognormal distribution. Being asymmetric, it often fits right-skew sales data and process-time data very well indeed. It is bounded below by zero. And it may be parameterized in a fairly simple way (see above). We use it a lot in our simulations.

Understand the Cumulating Effect of Agent Efficiency on Queue Time

Consider, again, a simulation of a drive-thru. In the simulation, order-takers must decide: which agent should be assigned to each car; whether they should walk to their assigned car or let that car come to them; whether they should cross a lane to go to a car; etc. Improvements in agent behavior might shave a few seconds off the time it takes for an order-taker to service a car. Not very important, right?

Wrong. During busy periods, it might be common to have as many as 20 cars banked up waiting for service. If every car in front of the car at the back of the queue is processed a few seconds more quickly, the car at the back will get through the queue a minute faster. That’s not insignificant. Customer satisfaction improves and customer throughput rises. Small changes can have big effects in long queues.

Anticipate the Need to Use Multiprocessing

If you expect your simulation to receive a lot of use or to need to run through many possible parameter setups, write your code in a manner that makes multiprocessing easy. This typically means making sure, among other things, that your logging is thread-safe and that each thread has its own random-number generator.

It then can be easy to take advantage of multiprocessing. For example, if you run each simulation 10 times, you might do each run each on its own thread to take advantage of multiple CPUs. We’ve seen this reduce processing time by over 75%.

Final Thoughts on Simulations

Simulations can help companies solve a wide range of problems – from routing and logistics optimization to internal bottleneck identification and resolution. At StrategyWise, we help our clients leverage their data through a wide array of tools like these to unlock value and create competitive advantage in the marketplace. We would love to talk to you about how we can build customized solutions for your company.

GET YOUR FREE CONSULTATION ON CUSTOM SIMULATIONS TODAY!