此课程是为影响转变数据成为更好的决定的想法而设计。最近在数据采集技术上的显著提升改变了公司进行有效决定的方式。

Loading...

來自 University of Pennsylvania 的課程

运营分析

個評分

此课程是为影响转变数据成为更好的决定的想法而设计。最近在数据采集技术上的显著提升改变了公司进行有效决定的方式。

從本節課中

Predictive Analytics, Risk

How can you evaluate and compare decisions when their impact is uncertain? In this module you will learn how to build and interpret simulation models that can help you to evaluate complex business decisions in uncertain settings. During the week, you will be introduced to some common measures of risk and reward, you’ll use simulation to estimate these quantities, and you’ll learn how to interpret and visualize your simulation results.

- Senthil VeeraraghavanAssociate Professor of Operations, Information and Decisions

The Wharton School - Sergei SavinAssociate Professor of Operations, Information and Decisions

The Wharton School - Noah GansAnheuser-Busch Professor of Management Science, Professor of Operations, Information and Decisions

The Wharton School

In session 2, we learned how to set up and run a simulation.

This time, we'll look closer at the simulation results.

Since the simulation provides us with estimates of reward and

risk measures, we need to understand how precise those estimates are,

if we were to use simulation results for comparing alternative decisions.

We'll set up a simulation with larger number of simulation runs and

compare its results with those of a short simulation to see how the precision of

simulation estimates changes with a number of simulation runs.

Finally, we will build histograms of simulated random quantities

to see what the simulated distributions of random inputs and

random outputs to a simulation look like.

Okay, in session 3 we will look at the interpretation and

visualization of simulation results.

In session 2, we have set up and run a simulation of monthly payment values

under the new wireless data plan that our consultant is considering.

Here are the results of this simulation, copied from the Excel file DataPlan10.

We've used ten simulation runs in that simulation.

In other words, we've generated ten random instances of monthly data usage taken

from a normal distribution with a mean of 23 GBs and a standard deviation of 5 GBs.

Those values are shown in green.

For each of those data usage values,

we have calculated the corresponding monthly payment value.

So we have a sample of ten monthly payment values and they're shown in blue.

Ultimately, we're interested in the distribution of, Monthly payment values.

But first, let's look at the sample of data usage values.

Why?

Because we actually know the distribution they came from and

we can compare the sample mean and the sample standard deviation for

these ten values to the mean and

the standard deviation of the distribution from which this sample was generated.

Sample mean which is around 25 GBs is just the average of ten random

data usage values that Excel generated in this particular simulation.

It is of course just an approximation to the true expected value

of the distribution from which these ten numbers were generated.

That true expected value is 23 GBs.

In the same way the simple standard deviation which is around 8 GBs

is just the reflection of ten random data usage values

that were generated in the simulation.

It is an approximation to the true standard deviation

of the random variable from which the ten numbers were generated, 5 GBs.

So let's now look at the estimates of the reward and

the risk that we get from the simulation.

In other words let's look at the sample mean and the sample standard deviation

of the ten monthly payment values generated by our simulation.

The estimate for the reward measure is around $253, and

for the risk measure is around $92.

In this simulation, we only generated ten random data usage values.

The more data usage values we generate, the closer will the resulting sample of

the data usage values be to the true probability distribution.

As a result, this sample mean and

the sample standard deviation will also be closer to the true values.

With this in mind,

instead of using ten simulation runs, let's do 1000 simulation runs.

And let's see how our estimates of the reward and the risk change.

Let's open our Excel file DataPlan10 and

modify it to create a simulation with 1000 runs.

Let's set up a simulation of the monthly data usage and

payment values under the new plan with 1,000 simulation runs.

We start with the file DataPlan10 and

modify it to increase the number of simulation runs.

Let's delete the random numbers from cells D-D11.

Remove sample means and standard deviation headers.

And also remove formulas and values from cells E3 through E14.

We'll keep the formula in E2 for the later use.

So we're deleting all of these numbers here.

We're deleting all of these numbers here.

We're deleting all of these numbers here as well.

And now we're ready to extend our simulation file.

Let's increase the count of the simulation runs in column C

up until the cell C1001.

So we copy and paste the formula in the cell all the way to C1001.

Let's make sure we got the numbers right.

We go to the last number, yes, it is a count of 1000.

Next, let's call Analysis ToolPak.

We go to the Data tab, click on Data Analysis,

select Random Number Generation, click OK.

For Number of Variables we put 1, for Number of Random Numbers we now

put 1000, Distribution is Normal.

With parameters 23 and 5.

Let's put 1 to 3 as our random seed, and

as an output range let's select D2 again.

Now we can click OK.

And you could see that Excel has generated

1000 instances of the random data usage and

put them in the cells D2 through D1001.

We can go and check to D1001 to see that we have exactly 1000 samples.

Let's change the font in cells D2, D1001, into green, and

bold, to designate that here we have the random input values.

Now we just need to copy and

paste the formula in E2 into the cells E3 through E1001.

Okay Before

we calculate the average and the standard deviation of the simulated values.

Let's hide the columns 11 through 1000

to make sure that the entire file looks compact.

So we're going all the way here.

And do Hide.

Okay, now let's calculate the sample mean and the sample standard deviation values

for both samples, for the data usage and for the payment values.

Let's put the corresponding headers into the cells C1003 and C1004.

Sample Mean and Simple Standard Deviation.

Okay, let's calculate the sample mean and the sample standard deviation for

the monthly usage data and put those values into the cells D1003 and D1004.

So we do average of all the data usage values,

D2 through D1001 and we do the standard

deviation of the sample in D1004.

So we have a STDEV of D2 through D1001.

We can mark those two cells appropriately, and then we can just copy and

paste those formulas into the cells for payment values.

And change the color.

Okay, we have our new reward and risk estimates for

the new data plan based on the simulation with 1000 runs.

The reward estimate is about $220, and the risk estimate is about $58.

Here's the copy of the Excel file DataPlan1000.

That reflects the results of the simulation with 1000 runs.

Let's now compare the results of a simulation with 10 runs with those of

a simulation with 1000 runs.

Looking at the sample mean and the sample standard deviation for

the data usage values.

We see that the numbers for the loan stipulation, 1000 runs,

23.3 and 4.9 are pretty close to the true values 23 and 5.

Just like the sample mean and the sample standard deviation of the data usage

values generated by simulation with 1000 simulation runs are much closer

to the true values, the sample mean and the sample standard deviation for

the payment values from the same simulation are likely to be much

in a colloquial sense closer to their respective true values.

In other words, the longer is our simulation, the more simulation runs

it contains, the more confident we are in the estimates we obtain.

Now that we're comparing short simulations with long ones,

it is a good time to return to the notion of the random seed.

What value should we be using in a simulation?

To get a handle on this question, I have run a short simulation, the one with 10

runs, with three different seed values, and recorded the resulting values for

the sample mean and the sample standard deviation of the data usage values.

Here's what I got using the seed 123.

And this is what I got when I used the seed 1826, which

is the year in which Joseph Wharton the founder of the Wharton School was born.

And here's what the results were for the simulation with the seed

19104 which is the zip code where the Wharton School is located.

As you could see, the simple mean values jump all over the place.

From around 25 to around 19 and then back to around 25.

The sample's standard deviation values are not stable either.

They go from around 8 to around 5, to around 3.

But now let's see what happens if we run a long simulation,

a simulation with 1000 fronts with these 3 seed values.

Here's what we get using the seed 123.

And here's what we get for the seed 1826.

Now, what about that seed 19104?

Here are the results.

As you can see, Sample Mean and Sample Standard Deviation values do

not depend much on the seed values for this loan simulation.

So, we should just run loan simulations and use whatever seed values we want.

As long as the simulation has many simulation runs,

the seed value does not matter much.

Now that we understand that simulations convert probability distributions of

input random variables into probability distributions of output random variables,

can we go beyond looking at the sample mean and

the sample standard deviation numbers?

Can we look at the entire distributions?

Well histograms of simulation results allow us to add a full picture of what

happens with our random variables.

In the data plan example, our random input is data usage, and

the random output is the payment amount.

Let's go back to our Excel file DataPlan1000 and

build histograms of the data usage and of the payment values.

We will use the Excel file DataPlan1000 as a starting point for

building histograms of inputs and outputs of a simulation.

This file contains the results of a simulation of the monthly data usage and

the monthly payment values for the new wireless data plan.

This simulation was conducted using 1000 simulation runs.

The random sample of monthly data usage is located in the cells

D2 through D1001 and the corresponding 1000 monthly payment values

are located in the cells E2 through E1001.

Just keep in mind that we hid rows 11 through 1000 in displaying this file to

make sure that we have a compact picture of simulation results.

First, we'll build a histogram for

the simulated values of the monthly data usage.

Histogram is a frequency chart that displays how many simulated values

fall within a certain range.

The central concept for any histogram is a so called bin.

A bin is a range of values for a simulated random variable.

In order to build a histogram of the monthly data usage,

we need to provide Excel with a set of bins and then Excel will count and

put on a chart the numbers of random usage values within each bin.

How do we select a set of bins for a histogram of a random variable?

Let us look at the monthly usage data as an example.

Let's go to the cells D1005 and D1006, and calculate the smallest and the largest

among the random monthly data usage values in the cells D2 through D1001.

In particular, we put the formula

MIN(D2:D1001) into the cell D1005.

And the formula MAX(D2:D1001) into the cell D1006.

Let's also add headers MIN and

MAX, into the cells.

C1500 and C1006.

We see that the smallest random date usage value that Excel has generated for

us, is about 9.1 GBs.

And the largest is about 38.5 GBs.

Clearly, if we want to generate a meaningful frequency chart,

our bin should approximately cover the interval between 9.1 and 38.5 GBs.

So let's choose the set of bits that starts with the one that covers

the interval up to 10 GBs.

Followed by the bin that covers the interval between 10 and 11 GBs,

then the bin that covers then between 11 and 12 GBs and so

on with the last bin covering the interval between 38 and 39 GBs.

In Excel,

the bins are defined by the largest value that falls within a particular bin.

So that the bin that covers the interval up to 10 GBs is defined by the value 10

GBs.

Such a bin would cover the values of data usage that are less than or

equal to 10 GBs.

In a similar way, the next bin defined by the value 11 GBs will cover the values of

the data usage that are strictly greater than 10 GBs and now less than or

equal to 11 gigabytes.

The less than will be defined by the value 39 gigabytes, and

will cover the value of data usage that are strictly greater than 38 gigabytes,

and are less than or equal to 39 gigabytes.

Let's create the set of bin values Before we pass it to excel.

In the cell A 1008,

Let's put in a header U bins to indicated that what

set of bin values for the monthly data usage.

Then in the cell A 1009, let's put the value 10 to designate the bin for

the smallest value subdata usage generated in our simulation.

Next, in the cells A1010 through A1038,

let's put the defining numbers for the rest of the bins.

Let's just check, 39 that's good.

Okay, we're now ready to create a histogram of data

usage values generated by our simulation.

Let's go to data, data analysis, select histogram and click OK.

Now, in the histogram dialogue box, we go first for the input range,

and select the cells D2 through D1001 as our input values.

Then in the bin range we select

A1009 through A1038

to tell Excel which bins to use when calculating the frequencies for

the histogram.

Finally, in the output options we select the output range.

Let's choose a cell, let's say B1008 right here.

B1008 will be the top left corner of the table of the histogram frequencies that

Excel will generate for us.

Let's also check the chart output to make sure that Excel will also

present the results in the chart format, and then let's click OK.

As a result we get a table and a chart as the table's graphical representation.

The table contains the frequency for each bin.

Each frequency value indicates how many out of 1,000 simulated data usage numbers

fall within a particular bin.

For example, the frequency value 2 that corresponds to the bin label 10

indicates that exactly 2 out of 1,000 data usage values in the cells

D2 through D1001 are less than or equal to 10 gigabytes.

Another example, the frequency value 4 that corresponds to the bin label 12,

indicates that exactly 4 out of 1,000 simulated data usage values are strictly

greater than 11 gigabytes, and are less than or equal to 12 gigabytes.

If you look at the bottom of the frequency table, you will see that

the Excel has added another catch-all bin, labeled More that counts

any simulated data usage values that are strictly greater than 39 gigabits.

The frequency count for this bin is 0.

This should not be surprising,

remember the highest simulated data usage value was about 38.5 gigabytes.

There's simply no values in the simulated sample that exceed 39 gigabytes.

With respect to the chart, one can always make some minor

cosmetic modification to satisfy our personal tastes.

For example, we can change the color of the bars

on our chart to green to reflect the fact that we

are having a chart of the input values for our simulation.

We can also use Format Data Series to change the gap between the bars.

To make it look a little bit better.

Okay.

We can also use format axis and

go to labels to make sure that we have a label for every bin.

For example we can do it like this

Okay.

And also change the font here.

To make sure that we see all the labels that can be remove legend,

what can change the x's designation to data usage.

U in gigabytes.

You can do many other things to, like I said, to satisfy our tastes.

Okay, we're done with the histogram of the simulated data usage values.

Let's now move on to building a histogram for

the simulated monthly payment values reported in cells E2 through E1001.

As before let's first decide on the bins that we'll be using for our histogram.

If we copy and paste the formulas for the smallest and

the largest values from cells D1005 and

D1006, I'm trying to find them right here,

to the cells E1005 and E1006, like so.

Now we're talking about the minimum and the maximum values of the monthly payment.

So we see that all the thousand payment values coming out of our assimilation

are between $160 and a little over $438.

So if we select the set of bins that goes from 160 to

say 440 in a steps of 10 to make sure that we do not have too many bins, we

will be able to cover in our histogram all the monthly payment values we simulated.

Okay, so let's go to

the cell A1042 and

add the header P Bins.

After that, let's put 160 into the cell A1043,

and let's keep adding ten To each

preceding value until we reach 440,

so let's go to So, 1,007 changes, see?

1,071.

Okay, now we have our bins, let's call the Histogram tool.

We do Data > Data Analysis > Histogram, click OK.

Input Range, well, this time we're talking about the input range from E2

through E1001.

The bin range, well, we have new bins.

Here we have to go all the way here and

put in the bins from A1043 to A1000 and 71.

Output range, well, let's go and

put our table next to the bins.

So, let's put it here, B1042 and let's request the chart as well,

and let's click OK.

So, we have the table, the frequency table and the corresponding histogram chart.

So, we can find the chart and apply the same

improvements to this particular chart.

This will do it real quick.

Let's change the gap, as we did before.

Something like this.

Well, that's too much, maybe a little smaller.

Okay, and let's also go here and

change it to Payment,

B, in dollars

Let's make sure here that we have labels for every box, okay?

So here's a copy of the histogram for the data usage values we've created We stored

the result in a new file called data plan one thousand underscore histogram.

And here's what we got for the payment values.

Now the data usage values, as the histogram indicates as well,

came from a normal distribution The distribution of the payment values,

though, does not really look normal.

One interesting thing about this histogram is that it indicates that

in about 250 cases out of 1,000 we simulated,

the monthly payment amount was exactly equal to $160.

In other words, this histogram tells us that there's about 25% chance

that under the new plan, the payment amount will be exactly $160.

In many settings, simulation is the tool that must be used

to gain insights like this into the nature of future outcomes.

And histograms are useful visual complements to simulations.

So let's look again at our roadmap.

When we make decisions in high uncertainty settings

we should first define the reward and the risk measures.

In the data plan example the expected monthly payment was selected as the reward

measure and the standard deviation of the monthly payments As the risk measure.

Next, we should use simulation to obtain estimates for the reward and

risk measures.

Here are the estimates we have obtained from a simulation

with 1000 simulation runs in the data plan example.

Finally We can use optimization tool kit to choose the best alternative using for

example, the reward is the objective and the risk as a constraint or constraints.

You will see how this is done in week four.

If you want to run large complex simulations in practice The existing

number of commercial simulation software packages that can help you set up and

run simulation as well as visualize its results.

Here's something to a recently compiled list of commercial simulation software

packages.

Some software packages also have the capacity to figure historical data to

probability distributions.

To help you work through the task of coming with appropriate distributions for

the simulation.

The data plan simulation example look at is small.

It has one random input and one random output.

In practice simulation model can have many random input factors And

many random outputs whose distributions are of interest to decision makers.

In week two, when discussing the optimization toolkit,

we mentioned a journal called Interfaces.

This journal is also a great source of examples of successful

use of the simulation toolkit.

In numerous industries.

Here, we mention two such examples.

The first one describes how intel, a company that does not really need

an introduction uses simulation in addition to a number of other analytics

tools to support a process of purchasing its manufacturing equipment.

The second article, Is about the implementation of simulation-based

approach to making the best pharmacy inventory decisions at Kroger,

the US largest grocery retailer, and the world's fifth largest retailer.

As you can infer from this articles,

analytics approach is in practice are rarely used in isolation.

Most of the time That combined together like optimization and

simulation to form an effective solution tool.

We have reached the end of week three of operations analytics course.

During this week we have looked at how to use simulation tool kit to evaluate and

compare alternatives in settings where our managerial decisions result in

a distribution of outcomes.

Rather than one certain outcome.

In particular, we use an approach for comparing decisions and

uncertainty that involved identifying a reward and risk measures and

used a simulation to obtain estimates of those measures.

Similar to what we had prepared for you in week two, we

have put together an optional video review session and two practice problems to help

you master the simulation tool kit And get you ready for the homework questions.

Please keep in mind that if you feel comfortable with the simulation concepts

you have seen so far, you can skip the review session and

try the practice problems.

Just as in Week 2, the questions we're asking you on the practice problems for

Week 3 are very similar to the question you will see on Week 3's homework.

Now where do we go from here?

Once we know how to estimate reward and

risk, we can compare alternative decisions using for example reward

as the objective function and a risk measure or measures as constraints.

This opens up a way to combine an optimization and

a simulation together in search for the best decision.

This will be one of the topics of week four of our course, enjoy.