The 80% solution
Horseshoes and hand grenades...and simulation,
We always look for the lowest-risk choice, the highest return on investment or the shortest distance between two or more points. But when traditional, exact computational models break down, it's time for approximation and simulation. You may not arrive at the optimal answer, but you can develop a high degree of confidence in your result. This article discusses a variety of simulation and estimation techniques designed to make your solutions time-sensitive. (5,400 words)
The risk of being too binary is that we sacrifice timeliness for precision. What good is the optimal answer if it arrives an hour, a day or a month too late to be turned into a management decision? What happens if your algorithm produces a result based on data that has changed dramatically during the computation cycle? In the worst case, what if the problem becomes so large that your algorithm runs for months or years, never producing a useful answer? At the other extreme, what about the cases for which there simply is no algorithm? When you have to make estimates or predictions about the future, there are few easily coded paths to a solution. Our ability to present a black-and-white, perfect answer every time is under pressure as well.
When traditional, exact computational models break down, it's time to employ approximation and simulation techniques. You may not arrive at the optimal answer, but you can develop a high degree of confidence in your result. We're going to look at a variety of simulation and estimation techniques designed to make your solutions time-sensitive. Or as Sun's Chief Information Officer Bill Raduchel says frequently, "The right data at the right place at the right time."
We'll open with an overview of the truly "hard" problems in computer science, those for which a suitably large input sends your server off into infinite-compute land. We'll characterize them as simply as possible, and describe some of the approximation techniques used for solutions to these practical problems with impractical algorithms. From there, we'll turn to the broader area of Monte Carlo simulation and its variants, applying it to some financial problems. Our goal is to equip you with criteria and motivations for choosing a simulation and to differentiate the methods you'll apply to get 80 percent (or better) results.
The traveling salesman's problem (TSP) is so common it seems trivial: Given a list of cities, and the cost to travel between pairs of those cities, compute the lowest-cost, round-trip circuit that visits each city once and only once. Ask any weathered theoretical computer scientist for the solution, and you'll hear the phrase "NP-complete" uttered in between shakes of the head. (See the sidebar Completely Unsolved for an explanation of NP-completeness.)
Algorithms for solving NP-complete problems require compute time that increases exponentially with the number of inputs. In the case of the TSP, the optimal solution can only be computed by evaluating every possible round-trip, requiring calculations on the order of 2^N where N is the number of cities.
For small problems of four or five cities, you'll get an answer, and you might even sketch out the solution on a napkin without too much effort. Now try doing this for "only" 40 cities on a machine that gives you a handy 100 Mflops -- you'll evaluate about 1 million routes a second with 100 additions or so needed to cost out each route.
A million routes is only 2^20 cases, and you'll need to look at 2^40 samples before you're done -- a million seconds, or almost 12 days later. Add another city, and the compute time doubles again. Exponential growth is a dangerous thing to embed in an algorithm; even Einstein marveled at it when he proclaimed compound interest to be the strongest force in the universe.
If you find the theory behind these computations esoteric, you're not alone. Language and computation theory is a make-or-break proposition only for those pursuing higher degrees in computer science. So how does it fit into the real world? Many of the problems we solve simply for small cases explode into endless summers of computing for slightly larger sets. Consider what's involved in proving that a section of code operates correctly -- you have to show that expected, bounded results occur for every possible input combination and every conditional branch. For a small piece of code, you can do this by inspection. Five variables and three tests are easy, but try doing this for something with tens of variables and thousands of lines of code.
This problem is related to another NP-complete headache known as the "decidability problem." The decidability problem uses an abstraction known as a Turing machine, a simple model of computing in which operations are described as transitions from one state to another.
(Let's define a bit of jargon: State A state is simply an input (or set of inputs), and the mapping of input values to the next state you'll enter. For example, a telephone can be defined with three states: quiet, ringing and in use. With no input, the phone stays in the quiet state. When a call arrives (new input), the phone transitions to the ringing state. Without any further input, it goes back to the quiet state, unanswered; an input of "lift handset" makes the phone go into the in use state. From there, it always goes back to the quiet state.)
We want better answers,
based on more data,
with higher precision and a shorter turnaround.
Given any arbitrary input, you want to known if the Turing machine will reach a "final," or terminal state, or if it will loop endlessly between intermediate states. If you want the answer, you have to run through the cases one by one. The longer the input, the larger the problem gets.
Another example is integer linear programming, a favorite tool of operations researchers and other people who completed first-year MBA course work. Linear programming takes constraints in the form of linear equations and optimizes for some goal, for instance, lowest cost or maximum profit.
Here's a typical problem: Your local bakery has 100 pounds each of flour and sugar on hand. They make cookies with 1 pound of flour and 1 1/2 pounds of sugar, selling each batch for $5.49, or they can make a large cake with 2 pounds of flour and 3 pounds of sugar and sell it for $12.99. What should they bake to maximize their revenue? If you've played with Lotus 1-2-3 or Excel, you've seen the Solver function that evaluates linear programming problems like this one fairly easily.
However, standard linear programming solves this problem too easily and exactly, telling you to bake some irrational number of cakes and cookies. Unfortunately, customers like to buy integer quantities of goods, so you have to add constraints to keep the outputs in whole numbers. Adding those constraints makes linear programming much more difficult because you have added an arbitrary Boolean expression to the set of constraints.
Determining the variable values that make a boolean expression true is another NP-complete problem known as the "satisfiability problem." In short, if you want the answer, you have to try all of the combinations of input variables to find it. The same is true for integer linear programming, leading to rather long run times. If you look at the "Options" dialog box for the Excel Solver, it allows you to limit the run time, the precision of the answer and the number of iterations attempted.
Again, changes in business practices can push you from the small to the large case -- someone who is evaluating a product mix based on three variables may have no problems, but when the number of supplier options, cost points, and product components increases, the problem difficulty escalates at an alarming rate.
Doing the impossible before breakfast
So what's a practical software developer to do? First and foremost, keep your eyes open for problems that look tantalizingly easy to solve but in fact are variations on some popular NP-complete problems. Theoreticians call this "reducing to an NP-complete problem;" we'll call it saving time and money later on. Evaluate your algorithms to ensure that they won't go off on unbounded compute runs.
Let's return to the immensely practical world again. Salesmen obviously travel, but not always at lowest cost, especially if there are convenient detours for golf or nice lunches. The cost of not traveling -- not solving the problem at all -- far exceeds the minor variations in cost sorted out by a TSP solution. When time is money, you need to consider approximate solutions.
Researchers at AT&T Bell Labs developed an approximation technique for TSP called 2-opt that makes an initial guess at a solution, and then tries swapping the order in which pairs of cities are visited to see if the total cost can be reduced. After the point of diminishing returns is met, the algorithm terminates. The differentiator between theory and practice here is "diminishing returns" The answer may not be perfect, but it's close enough to book the flights. If you're trying to make a million-dollar decision, variations of a few dollars in cost probably won't show up in the big picture.
Turning to approximation introduces a new problem: how do you determine how much precision is required? Medical data, such as X-ray images should never be approximated, since "Cut somewhere near the left ear" doesn't inspire patient confidence. Transferring money between parties requires precision ranging from hundredths of a dollar (for your paycheck) to thousandths of a cent (for high-volume currency transactions). But when there is a potential trade-off between time and cost, and the final result is being used primarily to formulate a decision, approximation works well.
Defeat from the jaws of victory
Just how well can you do? To a large extent, it depends on how you engineer the input data used to form the starting point and initial goal to beat. Careful selection of the input points can lead to good results quickly. Here's an example: Software testing is primarily an NP-complete discipline, as hinted above, because it requires complete coverage of every variable value and branch direction taken. Instead of trying to prove every function correct, a good first step is to demonstrate that your code never ends up in an unexpected state due to unexpected input, or with an out of bounds output.
The PrimaVera project at Sun Microsystems Laboratories took this approach to software verification and test data design, producing the Architectural Design Language (ADL) to define the code's intended behavior. The behavioral description is used to generate test data that quickly identify the possibility of an unexpected outcome. The ADL environment is available via ftp for exploration.
Software testing shows that sometimes you don't need to solve a problem precisely at all, but merely prove that a wrong answer is possible. The "negative solution" approach was used by Matt Blaze (also of AT&T Bell Labs) when he demonstrated flaws in the design of the Clipper chip. Most encryption algorithms rely on NP-complete problems. Breaking an encryption algorithm by exploring the entire key space requires time exponential in the number of bits in the key -- so breaking a 512 or 1024 bit key is considered practically impossible.
Blaze looked at the effectiveness of the Law Enforcement Access Field (LEAF) in the Clipper chip, the "backdoor" wired in to allow the police to decrypt any transmission after retrieving the master key from electronic escrow. (More on the Clipper chip and the discussion surrounding it.) While everyone was worried about providing an electronic wiretapping feature, Blaze demonstrated a mechanism for rendering it useless by filling it with seemingly valid, but unexpected and undecipherable values. The encryption wasn't broken, but the "backdoor" access was eliminated. (Blaze offers a PostScript-formatted paper describing his findings.)
Each new wave of computing brings a new suite of approximation techniques with it. Some people have postulated that a small (40-bit) DES key space could be searched with a massively parallel compute engine in a fraction of a day, making DES encryption unsuitable for long-lived transactions. The first cross-over between molecular biologists and computer scientists gave us Jurassic Park; the current combining of the fields is preaching DNA-based supercomputers. With an exponentially large number of compute "engines," it's much easier to map exhaustive searches onto the computing equipment. Leonard Adelman (the "A" in "RSA") first described molecular computing in Science, and later published a PostScript-formatted paper about the topic.
Princeton University's Richard Lipton designed a coding scheme for setting computation problems in DNA. Lipton has proposed that NP-complete problems, including cracking DES, will no longer live on the outer limits of computation when run on a DNA-based machine. The August issue of Wired magazine discusses DNA supercomputers in a feature-length article. Immovable complex problems meet unstoppable genetic horsepower, and the compute side wins.
Games people play
Pure, procedure-oriented solutions assume that there's some algorithm corresponding to the problem you want to solve. For some optimizations, that's true, but in many actual cases you will be dealing with non-linear, non-deterministic, unstable, or poorly defined systems. How can you predict the future movements of stock prices or weather patterns? You can't know with certainty, but you can use historical models and event probabilities to make an educated guess. The event probabilities tell you the likelihood of any particular outcome and are used to generate "trial runs." Iterative modeling is a process in which repeated trials are used to give you a feel for the expected output. Iterative modeling of a complex system is part of a broad class of solutions known as Monte Carlo simulation.
The typical Monte Carlo simulation involves three steps:
Price Moves: Probability Cumulative Probability more than 10% 1% 1.0 8-10% 3% .99 5-8% 21% .96 4% 20% .75 3% 18% .55 2% 12% .37 1% 10% .25 0.5% 10% .15 less than 0.5% 5% .05
The relationship between the input range and the probability of each of the possible outcomes is known as a probability distribution.
John von Neumann is credited with giving Monte Carlo simulation its name. He developed the method while at Los Alamos National Laboratory during the Manhattan project, using it to predict the behavior of subatomic particles. The name derives from the famous casino of Monte Carlo, headquarters of similar pragmatic probability studies and games of chance.
Here's an extremely simple example of a physical Monte Carlo simulation: Someone brings you a sheet of paper with an irregular outline drawn on it, and wants to know the area inside the boundary. You could draw a grid on the paper and count the number of unit squares inside, or you could estimate by throwing 100 darts at the paper and counting the number that landed inside the shape. Assuming your dart-throwing is random, you'll get a fairly coarse estimate of the area in return for a small amount of work.
Of course, calculating the area could just as easily be done with the grid approach, and with more accuracy. But what if there is no exact measurement tool available? Consider the case of a bank that is evaluating its exposure to a drop in interest rates. Based on historical data, the bank can estimate the number of customers who will refinance their mortgages if rates drop a quarter or an eighth of a point.
But now we have to add complexity: The historical data reflects a different set of economic conditions and current customers might require a smaller change in interest rates to make refinancing attractive. Furthermore, each customer has a different mortgage amount and current balance, affecting the total loan volume to be refinanced. Finally, not all customers may refinance at the same institution, so the bank must be careful to remain competitive if the change in rates actually transpires.
Each of these variables -- loan amount, balance, rate sensitivity, customer loss, and interest rate changes -- can be modeled as a series of probability distributions. Using Monte Carlo simulation, the bank can simulate 10,000 or 100,000 scenarios, and take an average over the results to determine their expected cash flow, financing activity and customer retention.
Historical data helps form the model, but it doesn't take the place of an oracle that can predict the future. Simulation is far from perfect, but it does form a foundation for decisions in the face of imperfect, incomplete and invalid data points. For an actual, practical application, look at JP Morgan's RiskMetrics package. It's a Monte Carlo simulation of exposures in a variety of investment vehicles. JP Morgan has made the code, data, and documentation available.
Glass houses and glass sheets
Simulation is especially useful when the sequence of events contains dependencies. In our mortgage example, customers who refinance may choose to take a smaller or larger mortgage, or to go to another lender. Customers who don't refinance don't enter the "potential lost-customer pool." An accurate simulation of these scenarios requires feedback mechanisms that link the appropriate actions; the random numbers generated will be used to determine which path the customer takes.
Closer to the machine room, simulation with state information is commonly used to derive Mean Time Between Failure (MTBF) data for fully configured servers and network services. You can get the MTBF numbers for each component from your vendor, but how do you predict the MTBF for the whole system, accounting for reliability features such as RAID arrays, disk mirroring, and redundant network connections? Just computing the expected time for a single component to fail doesn't tell you much. Instead, you need to evaluate a sequence of events that leads to a failure.
For example, if you simulate until you hit a single disk failure, you still haven't found the system MTBF, because a redundant disk from a RAID array may provide the same data. You need to keep simulating until you get multiple failures in the array that result in data loss, or until you lose a component (like the backplane) that isn't redundant. To model the event sequence, you need to evaluate each of the possible paths (state transitions) you can take from any configuration: What can fail, and with what probability, and where can you go from that new state?
Create a state transition diagram, assign probabilities to each transition based on the MTBF figures, and then simulate away until you end up with a virtual failure.
Simulation allows you to compress time,
so that you can mimic large-grain effects. . .
One of the great advantages of simulation is that you can compress actual time into simulation time to cover events that may be beyond the normal testing horizon. If you want to test for a failure that occurs once every 500,000 hours, you don't need 500,000 hours of clock time. Run a simulation with each iteration covering a virtual hour, and let it cycle one million times -- you'd expect to see two failures in a simulation that extends over a few hours or days.
Let's cast the TSP as a Monte Carlo simulation. Generate a random permutation of the integers 1 to N, and use that as the order in which you visit the cities. Do this a few hundred times, and use your best result. The risk of pure Monte Carlo simulation is that you never come close to the ideal solution. If you don't record the improvement made with each progressively better solution to identify a point of diminishing returns, you may end the simulation too early. The 2-opt algorithm swaps pairs of cities, looking for some improvement, bringing you closer to the optimal answer. However, 2-opt can get stuck in a "local well," bringing you to an ideal answer that ignores a better solution using a wildly different ordering of the cities. What you need is a "hill climbing" function to lift you out of the local well so can you explore other, possibly deeper, points on the cost surface.
There's a good example of hill climbing in the physical world -- glass annealing. During the annealing process, the glass is repeatedly heated and then allowed to cool. Heating the glass takes it out of its initial state, so that it may enter a stronger, more stable state during the next cooling sequences. Combining this "whack" to the system with Monte Carlo simulation is called simulated annealing.
Bill Rosenblatt (SunWorld Online's Client/Server columnist) describes a TSP solution that relies on simulated annealing: Break the TSP problem up into small subproblems, each of which can be solved easily. Randomly choose routes between cities in the subproblems to complete the solution. The annealing process comes into play when you evaluate whether you made the best initial partition of the cities into subproblems. What if you only find a local minimum, and have missed a much better solution that requires a different partitioning of the cities? Swap pairs of cities between subproblems (raise the temperature by introducing randomness), and then calculate a "best case." Repeat the process of swapping cities (heating) and solving the subproblems/joining the groups (cooling) until you are getting only minimal return from the annealing process.
Where is simulation a good fit? When you're dealing with inputs that are hard to characterize, where you have irregularities and discontinuities in the input functions, or when the optimal solution involves an NP-complete problem, simulation is sometimes the only solution that produces plausible results. Running a few million iterations of a Monte Carlo simulation is trivial compared to the 2^40 calculations required for a 40-city TSP solution. Simulation run times grow slowly, while complex problems consume CPU cycles at an exponential rate.
Computation-driven systems handle linear systems, but only simulation is going to deal with real-world problems that go non-linear. It's easy to develop estimates based purely on past performance and historical data, although these tend to be expensive in terms of compute cycles. However, history doesn't reflect the full range of possibilities:
Till Guldimann, managing director of research at JP Morgan and principal contributor to the RiskMetrics effort, says that simulation lets you learn from real-world events and then imagine new ones. The imaginary events could be profitable or disastrous; determining your exposure and risk tolerance in each case will improve the quality of management decisions. Simulation allows you to compress time, so that you can mimic large-grain effects like building Web popularity or retail channel penetration that require several days or months to occur.
Information dissemination on the World Wide Web can turn your site from unknown to truly hip in a matter of days, burying your site in thousands of hits per day. What if you are suddenly "found" through Yahoo or Lycos? What if What's Cool swamps your server? It's better to be prepared for success than to be caught understaffed with an underpowered server.
Of course, simulation has its downsides as well. It requires a large compute farm to generate a sufficient quantity of input points and output values. Each simulation is usually done on a custom, non-recurring engineering basis, making them hard to re-use. You may not get as optimal an answer as you can with linear programming, because you're essentially making guesses about incomplete data. The key question is: how much error can you tolerate in the result? Do you need 80 percent confidence level in the answer, or is a 50-50 shot sufficient? How much time can you devote to improving the quality of the simulation, and how much return will you see for the time spent?
Evaluate your simulation model carefully to be sure it's delivering valid and useful information. Some common concerns, taken from Raj Jain's book Computer Systems Performance Measurement and from Guldimann's comments include:
Cheap hardware opens up new avenues to non-traditional solutions. Simulation requires significant computing horsepower to be viable, but it also requires careful planning and design to make it beneficial. We're always tempted to solve the next larger problem, a slightly more complex version of what we can accomplish today. Recognize those problems that can't be solved via traditional algorithms as they scale up. Employ simulation and approximation where you want an answer that is just as valid as the data from which it was derived.
If you have technical problems with this magazine, contact firstname.lastname@example.org
NP-complete is shorthand for "nondeterministic polynomial complete", a description of how a solution is reached. Problems in the deterministic polynomial, or P class, can be solved with algorithms that loop or iterate a finite number of times. Each iteration is completely deterministic, that is, there are no options or subchoices that can be made while computing one step or loop. If that sounds like the way "normal" applications work, it is. If you code it, it's deterministic, right down to the exact behavior of bugs. The upper limit on the execution time for a P-class problem is related to the number of inputs (n) by a polynomial expression like n^3 or n^5. Even for large values of n, the algorithm will complete in some finite, reasonable time period.
NP, or nondeterministic polynomial problems, rely on a subtle variation of "normal" computation: Instead of each step taking you to exactly one next step, you can go to one of several next steps. There's still a polynomial expression for the number of iterations required as a function of the number of inputs, but each iteration now contains subproblems and sub-subproblems. To be NP-complete, the algorithm must have to examine every combination of choices -- a problem that grows exponentially in the number of inputs. Instead of n^3 or n^5, the problem requires 3^n or 5^n steps. As n grows, the execution time becomes unbounded. As a result, NP-complete problems are considered intractable -- there's no solution that could be computed on available hardware in a practical timeframe such as a human life span.
There are problems that fit the description of the NP class but don't require exhaustive search of the choices in each iteration. These so-called NP-hard problems may not be solvable, but it's also not possible to prove that they require time exponential to the number of inputs.
Sample states for mortgage customers. A current, upstanding customer (A) may be hit with a natural disaster like an earthquake (I), resulting in a default on the loan (J). Alternatively, the customer may move and sell the house, paying off the mortgage (H), and then finance a new loan with the same bank (D) or go elsewhere with their business (B). Simulating the effect of rate drops, changes in mortgage products or the probability of disasters involves changing the probabilities that a customer moves from one state to another. If interest rates drop, the probability that the (A)->(C) transition occurs increases; if rates go up customers tend to finance smaller amounts (G). In each state, you calculate the expected revenue stream, and the probabilities that the customer will move between states. For example, after taking a lower rate (F), smaller loan (G), the probability of a default (J) is usually decreased if economic conditions stay the same.
Sample state transitions for calculating system MTBF. Each state transition arrow is labeled with the event and probability of that event, calculated from the MTBF for the components involved. When a single disk fails, the system moves to state (C). Another failure of a disk in a different array leaves the system in state (C), but the second failure of a disk in the same array group takes the system to the terminal state (D).