Skip to content
How to create empir...
 
Notifications
Clear all

How to create empirical cumulative probability curve from observed data?

3 Posts
2 Users
0 Reactions
3 Views
Posts: 1
Customer
Topic starter
(@ivanarellanonaranjo)
New Member
Joined: 2 days ago

Hello,

I have 20 observed daily production values and I want to create an empirical cumulative probability curve (NOT normal or any theoretical distribution) that I can view by clicking "Cumulative Probability" in the result view.

Production_Data := [415, 410, 409, 410, 425, 404, 416, 398, 422, 385, 410, 408, 412, 408, 426, 421, 423, 421, 418, 408]

I need the cumulative probability curve to be based exactly on these observed values (empirical distribution), not fitted to any theoretical distribution.

I've tried using Discrete() with unique values and their frequencies, but I keep getting "Value is not probabilistic" errors when trying to access the cumulative probability view.

What's the correct way to create an empirical distribution from observed data that supports the "Cumulative Probability" visualization?

Thanks!

 

2 Replies
Lonnie Chrisman
Posts: 46
Admin
(@lchrisman)
Member
Joined: 15 years ago

First, you'll need to decide whether you want a continuous or discrete distribution. Given that you don't want it fitted in any way, I'm assuming you want a discrete distribution with these as the only possible discrete values. For that, the easiest approach is just resampling, one (of many) possible ways to implement that is:

ChanceDist( 1, Production_data, Production_data )

I'll mention that the first parameter should (in a strict sense) actually be 1 / IndexLength(Production_data), but when you just want uniform resampling ChangeDist accepts p:1 for convenience.

For a continuous distribution you do need to adopt some distributional form since it will need to decide how to interpolate between these values and extrapolate the tails. Once again there are several choices. The Keelin Metalog distribution is often a good choice, where you can use:

Keelin( Production_data, I:Production_data)

The Keelin distribution takes on the shape of your data -- if it happens to match a classic distribution, it can approximate those very closely, but it covers a much wider space of possible shapes.

As a side note,  for a distribution like this, displaying the PDF using Smoothing (rather than histogram) works better, I think, even though histograms are more robust across the full spectrum of all distribution types. To select smoothing, while viewing the PDF press Ctrl+U (uncertainty settings) and check the Smoothing radio button. This is just a graphing setting, it doesn't change the underlying result.

With the Keelin MetaLog, you have a couple hyperparameters that you can adjust -- specifically the number of terms and the bounds. In the above example, Analytica auto-selects the number of terms it thinks is appropriate, but you can have it use a specific number of terms using, e.g.,

Keelin( Production_data, I:Production_data, nTerms: 7)

These are unbounded continuous distributions, where the tails go to -INF and +INF (however, since the tails drop off exponentially fast, in both cases it is essentially zero probability <300 or >500 in this case. If you know from some other knowledge that there is a hard-lower or upper bound, you can include either or both:

Keelin( Production_data, I:Production_data, nTerms:7, lb:350)

or

Keelin( Production_data, I:Production_data, nTerms:7, lb:350, ub:450)

Your original data consisted of all integers, where as Keelin is a continuous distribution. If you really want integers, I would probably just round:

Round( Keelin( Production_data, I: Production_data, nTerms:7 ) )

 

Reply
Lonnie Chrisman
Posts: 46
Admin
(@lchrisman)
Member
Joined: 15 years ago

You had also said that you tried using Discrete(...). I thought I should also comment on that.

Discrete(...) is not a distribution function, but is instead a function used for specifying the Domain of a variable. When you set the Domain to "Discrete" using the domain-type pulldown, it sets the domain expression to:

Discrete( )

You can see this by selecting the "Expression" view in the Domain attribute.

Among other things, this can provide a clue to the PDF and CDF calculations as to whether it is computing a discrete or continuous distribution. It is also used by the optimizer for determining the type of decision variable, plus for a few other things.. You could instead set the domain expression to be

Continuous( )

Anyway, Discrete(...) does sound like it might be a distribution function, but isn't.

Reply
Share:

Download Free Analytica


    We hate spam as much as you. We won't share your email with third parties.

    The free edition of Analytica includes these key Analytica features:
    Free Analytica has no time limit. The only constraint is it won’t let you create more than 100 variables or other objects. But your model can be quite substantial since each variable can be a multidimensional array. It also lets you explore, change inputs, and run existing models of any size (excluding features unique to the Enterprise or Optimizer editions).